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Abstract 

Discovering patterns from data is an important task in data mining. 
There exist techniques to find large collections of many kinds of 
patterns from data very efficiently. A collection of patterns can 
be regarded as a summary of the data. A major difficulty with 
patterns is that pattern collections summarizing the data well are 
often very large. 

In this dissertation we describe methods for summarizing pattern 
collections in order to make them also more understandable. More 
specifically, we focus on the following themes: 

Quality value simplifications. We study simplifications of pat- 
tern collections based on simplifying the quality values of the 
patterns. Especially, we study simplification by discretiza- 
tion. 

Pattern orderings. It is difficult to find a suitable trade-off be- 
tween the accuracy of the representation and its size. As a 
solution to this problem, we suggest that patterns could be 
ordered in such a way that each prefix of the pattern ordering 
gives a good summary of the whole collection. 

Pattern chains and antichains. Virtually all pattern collections 
have natural underlying partial orders. We exploit the partial 
orders over pattern collections by clustering the patterns into 
chains and antichains. 

Change profiles. We describe how patterns can be related to each 
other by comparing how their quality values change with re- 
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spect to their common neighborhoods, i.e., by comparing their 
change profiles. 

Inverse pattern discovery. As the patterns are often used to 
summarize data, it is natural to ask whether the original data 
set can be deduced from the pattern collection. We study the 
computational complexity of such problems. 

Computing Reviews (1998) Categories and Subject Descriptors: 
E.4 Coding and Information Theory: Data Compaction 
and Compression 

H. 2.8 Database Apphcations: Data Mining 

I. 2 Artificial Intelligence 

1.2.4 Knowledge Representation Formalisms and 
Methods 

General Terms: Algorithms, Theory, Experimentation 
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CHAPTER 1 



Introduction 



"But what kind of authentic and valuable informa- 
tion do you require?" asked Klapaucius. 

"All kinds, as long as it's true", replied the pirate. 
"You never can tell what facts may come in handy. I al- 
ready have a few hundred wells and cellars full of them, 
but there's room for twice again as much. So out with 
it; tell me everything you know, and I'll jot it down. 
But make it snappy!" 

Stanislaw Lem: The Cyberiad (1974) 

Mankind has achieved an impressive ability to store data |Rien3j . 
The capacity of digital data storage has doubled every nine months 
for at least a decade |FU02j . Furthermore, our skills and interest 
to collect data are also remarkable |LV03j . 

Our ability to process the collected data is not so impressive. 
In fact, there is a real danger that we construct write-only data 
stores that cannot be exploited using current technologies |FUf)2j . 
Besides constructing data tombs that contain snapshots of our world 
for the tomb raiders of the forthcoming generations, this is not very 
useful. It can be said that we are in a data rich but information 
poor situation |HK01j . 

In addition to the immense amount of data being collected, the 
data is becoming increasingly complex and diverse [Fay 01} ISPI^'02j : 
companies collect data about their customers to maximize their 
expected profit jKKS02j . scientists gather large repositories of ob- 
servations to better understand nature HAK"'"n2 and governments 
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of many countries are collecting vast amounts of data to ensure the 
homeland security which has been recognized to be a very impor- 
tant issue due to the globalization of conflicts and terrorism |Yenfl4j . 
When several different data repositories are combined, the data con- 
cerning even only a single person can be tremendously large and 
complex. 

Due to the weakness of the current techniques to exploit large 
data repositories and the complexity of the data being collected, a 
new discipline known as data mining is emerging in the intersec- 
tion of artificial intelligence, databases and statistics. The current 
working definition of this new field is the following IH MSfllj : 

Data mining is the analysis of (often large) observa- 
tional data sets to find unsuspected relationships and 
to summarize the data in novel ways that are both un- 
derstandable and useful to the data owner. 

On one hand this definition is acceptable for a large variety of 
data mining scholars. On the other hand its interpretation depends 
on several imprecise concepts: The meanings of the words 'unsus- 
pected', 'understandable' and 'useful' depend on the context. Also 
the words 'relationships' and 'summarize' have vast number of dif- 
ferent interpretations. This indeterminacy in general seems to be 
inherent to data mining since the actual goal is in practice deter- 
mined by the task at hand. 

Albeit the inherent vagueness of the definition, the field of data 
mining can be elucidated by arranging the techniques to groups of 
similar approaches. The techniques can be divided roughly to two 
parts, namely to global and local methods. 

Global methods concern constructing and manipulating global 
models that describe the entire data. Global models comprise most 
of the classical statistical methods. For example, the Gaussian 
distribution function is a particularly well-known global model for 
real-valued data. The focus in the data mining research of global 
methods has been on developing and scaling up global modeling 
techniques to very large data sets. 

Local methods focus on discovering patterns from data. Patterns 
are parsimonious summaries of subsets of data |.b'U02j . The rule 
"People who buy diapers tend to buy beer" is a classical example 
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of such pattern. In contrast to global modeling approach, pattern 
discovery as a discipline in its own right is relatively new |Hann2j . 
(The term 'discovery' has recently been criticized in the context of 
data mining to be misleading since data mining is based on scien- 
tific principles and it can be argued that science does not discover 
facts by induction but rather invents theories that are then checked 
against experience |PB02j . The term is used, however, in this dis- 
sertation because of its established use in data mining literature.) 

The global and local methods can be summarized in the following 
way. The global modeling approach views data mining as the task of 
approximating the joint probability distribution whereas the pattern 
discovery can be summarized in the slogan: data mining is the 
technology of fast counting |Mann2j . 

The distinction to global models and local patterns is not strict. 
Although a Gaussian distribution is usually considered as a global 
model, it can be also a pattern: each Gaussian distribution in a 
mixture of Gaussians is assumed to describe only a part of the 
data. 

This work focuses on pattern discovery. There exist effective 
techniques to discover many kinds of patterns KtZDSI IMT97j . Due 
to that fact the question of how the discovered patterns could ac- 
tually be exploited is becoming increasingly important. Often the 
answer to that question is tightly coupled with the particular ap- 
plication. Many problems, obstacles and characteristics, however, 
are shared with different applications. 

A very important application of patterns is to summarize given 
data as a collection of patterns, possibly augmented with some aux- 
iliary information such as the quality values of the patterns. Un- 
fortunately, often the size of the pattern collection that faithfully 
represents the aspects of the data considered to be relevant is very 
large. Thus, in addition to data tombs, there is a risk of construct- 
ing also pattern tombs. 

1.1 The Contributions and the Organization 

The main purpose of this dissertation is to study how to summarize 
pattern collections by exploiting the structure of the collections and 
the quality values of the patterns. The rest of the dissertation is 
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organized as follows. 

Chapter [21 provides an introduction to pattern discovery that is 
sufficient to follow the rest of the dissertation. It contains a 
systematic derivation of a general framework for pattern dis- 
covery, a brief overview of the current state of pattern discov- 
ery and descriptions of the most important (condensed) repre- 
sentations of pattern collections. Furthermore, some technical 
challenges of pattern exploitation are briefly discussed. 

Chapter 131 concerns simplifying pattern collections by simplify- 
ing the quality values of the patterns. The only assumption 
needed about the pattern collection is that there is a quality 
value associated to each pattern. 

We illustrate the idea of constraining the quality values of the 
patterns by discretizing the frequencies of frequent itemsets. 
We examine the effect of discretizing frequencies to the accu- 
racies of association rules and propose algorithms for comput- 
ing optimal discretizations with respect to several loss func- 
tions. We show empirically that discretizations with quite 
small errors can reduce the representation of the pattern col- 
lection considerably. 

Chapter focuses on trade-offs between the size of the pattern 
collection and its accuracy to describe the data. The chapter 
suggests to order the patterns by their abilities to describe the 
whole pattern collection with respect to a given loss function 
and an estimation method. The obtained ordering is a refining 
description of the pattern collection and it requires only a loss 
function and an estimation method. 

We show that for several pairs of loss functions and estimation 
methods, the most informative /c-sub collection of the patterns 
can be approximated within a constant factor by the fc-prefix 
of the pattern ordering for all values of k simultaneously. 
We illustrate the pattern orderings by refining approxima- 
tions closed itemsets and tilings of transaction databases. We 
evaluate the condensation abilities of the pattern orderings 
empirically by computing refining approximations of closed 
frequent itemsets. The results show that already short pre- 
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fixes of the orderings of the frequent itemsets are sufficient to 
provide reasonably accurate approximations. 

Chapter [3 is motivated by the fact that a pattern cohection has 
usually some structure apart from the quality values of the 
patterns. Virtually all pattern collections have non-trivial 
partial orders over the patterns. In this chapter we suggest 
the use of minimum chain and antichain partitions of partially 
ordered pattern collections to figure out the essence of a given 
pattern collection. 

For an arbitrary pattern collection, its chain and antichain 
partitions provide clusterings of the collection. The bene- 
fit from the chain partition can be even greater: for many 
known pattern collections, each chain in the partition can be 
described as a single pattern. The chain partitions give a 
partially negative answer to the question whether a random 
sample of the data is essentially the best one can hope. We 
evaluate empirically the ability of pattern chains to condense 
pattern collections in the case of closed frequent itemset col- 
lections. 

Chapter IB] introduces a novel approach to relate patterns in a pat- 
tern collection to each other: patterns are considered similar 
if their change profiles are similar, i.e., if their quality values 
change similarly with respect to their common neighbors in a 
given neighborhood relation. This can be seen as an attempt 
to bridge the gap between local and global descriptions of the 
data. 

A natural way of using similarities is the clustering of pat- 
terns. Unfortunately, clustering based on change profiles turns 
out to be computationally very difficult. Because of that, 
we discuss advantages and disadvantages of different heuris- 
tic approaches to cluster patterns using change profiles. Fur- 
thermore, we demonstrate that change profiles can determine 
meaningful (hierarchical) clusterings. In addition to examin- 
ing the suitability of change profiles for comparing patterns, 
we propose two algorithms for estimating the quality values 
of the patterns from their approximate change profiles. To 
see how the approximate change profiles affect the estimation 
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of the quality values of the patterns, the stability of the fre- 
quency estimates of the frequent itemsets is empirically eval- 
uated with respect to different kinds of noise. 

Chapter [7| studies the problems of inverse pattern discovery, i.e., 
finding data that could have generated the patterns. In par- 
ticular, the main task considered in the chapter is to decide 
whether there exists a database that has the correct frequen- 
cies for a given itemset collection. This question is relevant 
in, e.g., privacy-preserving data mining, in quality evaluation 
of pattern collections, and in inductive databases. We show 
that many variants of the problem are NP-hard but some 
non-trivial special cases have polynomial-time algorithms. 

Chapter |S1 concludes this dissertation. 



CHAPTER 2 



Pattern Discovery 



This chapter provides an introduction to pattern discovery, one of 
the two main sub-discipHnes of data mining, and its central con- 
cepts that are used through and through this dissertation. A gen- 
eral framework is derived for pattern discovery, the most important 
condensed representations of pattern collections are introduced and 
the purpose of patterns in shortly discussed. 



2.1 The Pattern Discovery Problem 

The goal in pattern discovery is to find interesting patterns from 
given data |Hann2| lManf)2j . The task can be defined more formally 
as follows: 

Problem 2.1 (pattern discovery). Given a class V of patterns 
and an interestingness predicate q : V ^ {0, 1} for the pattern 
class, find the collection 

V, = {peV: q{p) = 1} 

of interesting patterns. Its complement Vq = V \ Vq is called the 
collection of uninteresting patterns in V with respect to q. 

The pattern discovery problem as defined above consists of only 
two parts: the collection V of possibly interesting patterns and the 
interestingness predicate q. 

The pattern collection V constitutes a priori assumptions of 
which patterns could be of interest. The collection V is usually 
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not represented explicitly since its cardinality can be very large, 
sometimes even infinite. For example, the collection of patterns 
could consist of all regular expressions over a given alphabet S. 
(For an introduction to regular expressions, see e.g. |HMUOl] .l This 
collection is infinite even for the unary alphabet. 

The absence of data from the definition might be a bit confus- 
ing at first. It is omitted on purpose: Often the interestingness 
predicate depends on data and the data is usually given as a pa- 
rameter for the predicate. This is not true in every case, however, 
since the interestingness (or, alternatively, the quality) of a pattern 
can be determined by an expert who has specialized to some par- 
ticular data set and the interestingness predicate might be useless 
for any other data set regardless of its form. For example, a com- 
pany offering industrial espionage that is specialized to investigate 
power plants can be rather poor detecting interesting patterns from 
gardening data. 

Defining a reasonable interestingness predicate is usually a highly 
non-trivial task: the interestingness predicate should capture most 
truly interesting patterns and only few uninteresting ones. 

Due to these difficulties, a relaxation of an interestingness pred- 
icate, an interestingness measure 

: P ^ [0, 1] 

expressing the quantitative value 0(p) of the interestingness (or the 
quality) for each pattern p £ V is used instead of an interestingness 
predicate. In this dissertation the value (pip) of p G V is called the 
quality value of p with respect to the interestingness measure (p, or 
in short: the quality of p. Many kinds of interestingness measures 
have been studied in the literature, see e.g. |TKS02j . 

Example 2.1 (an interestingness measure). Let the data set 
consist of names of recently born children and their ages (that 
are assumed to be strictly positive), i.e., let V he a set of pairs 
{name, age) € S* x M+. 

An interestingness measure <j) for the pattern class Progexp con- 
sisting of all regular expressions could be defined as follows. Let 
be the group of children whose names satisfy the regular expression 
p € T^regexp- The quality of a pattern p € Vregexp is the smallest age 
of any child in V divided by the average ages the children whose 
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names belong to the regular language p, i.e. 



4>{p,v) 



minja^e : {name, age) € V} 



( 



name, age 



:)GD|p ) / \{age : (name, age) G P|p} 



□ 



There are many reasons why interestingness measures are fa- 
vored over interestingness predicates. An important reason is that it 
is often easier to suggest some degrees of interestingness for the pat- 
terns in the given collection than to partition the patterns into the 
groups of strictly interesting and uninteresting ones. In fact, using 
an interestingness measure, instead of an interestingness predicate, 
partially postpones the difficulty of fixing a suitable interestingness 
predicate, since an interestingness measure implicitly determines an 
infinite number of interestingness predicates: 



In addition to these practical reasons, there are also some more 
foundational arguments that support the use of interestingness mea- 
sures instead of predicates. Namely, it can be argued that the actual 
goal in pattern discovery is not merely to find a collection of inter- 
esting patterns but to rank the patterns with respect to their quality 
values |Miefl4aj . Also, due to the exploratory nature of data min- 
ing, it might not be wise to completely discard the patterns that 
seem to be uninteresting, since you never can tell what patterns 
may come in handy. Instead, it could be more useful just to list the 
pattern in decreasing order with respect to their quality values. 

Also the interestingness predicates can be defended against the 
interestingness measures. The interestingness predicates determine 
collections of patterns whereas the interestingness measures deter- 
mine rankings (or gradings). On one hand, the interestingness pred- 
icates can be manipulated and combined by boolean connectives. 
Furthermore, the manipulations have direct correspondents in the 
pattern collections. Combining rankings corresponding to interest- 
ingness measures, on the other hand, is not so straightforward. 

Thus, the interestingness predicates and the interestingness mea- 
sures have both strong and weak points. Due to this, the majority 





otherwise. 



10 



2 Pattern Discovery 



of pattern discovery research has been focused on the combination 
of interestingness measures and predicates: they consider discover- 
ing cohections of interesting patterns augmented by their quahty 
values. 



2.2 Frequent Itemsets and Association Rules 

The most prominent example of pattern discovery is discovering 
(or mining) frequent itemsets from transaction databases |AIS9,S| 
IMa,nn2j. 

Definition 2.1 (items and itemsets). A set of possible items is 
denoted by I. An itemset X is a subset of I. For brevity, an itemset 
X consisting items Ai, A2, ■ ■ ■ , A\x\ can be written ^1^42 . . . A\x\ 
instead of {Ai,A2, ■ ■ ■ ,A\x\}- 

Definition 2.2 (transactions and transaction databases). A 

transaction t is a pair {i,X) where i is a transaction identifier (tid) 
and X is an itemset. The number of items in the itemset X of a 
transaction t = {i,X) is denoted by \t\. 

A transaction database P is a set of transactions. Each trans- 
action in D has a unique transaction identifier. The number of 
transactions in the transaction database T> is denoted by and 
the set of transaction identifiers in V by tid{T>) = {i : {i,X) £ T>}. 
In the context of this dissertation it is assumed, without loss of 
generality, that tid{J)) = {1,...,|2?|}. 

The set of occurrences of an itemset X in V is the set 

occ{X,V) = {i : {i,X) € V} 

of transaction identifiers of the transactions {i, X) G V. The num- 
ber of occurrences of X in I? is denoted by count {X, D) = \ occ {X, D) | . 

Another important aspect for frequent itemsets is the definition 
of what it means that an itemset is frequent with respect to a 
transaction database. 

Definition 2.3 (covers, supports and frequencies). A trans- 
action t = (i, y) in a transaction database D is said to cover or 
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support an itemset X ii X C.Y. The cover of an itemset X in I? is 
the set 

cover{X, V) = {i : {i, Y) eV,X QY} 

of transaction identifiers of the transactions in V that cover X. The 
support of X in 2? is denoted by supp {X, V) and it is equal to the 
cardinality of the cover of X in T>, i.e., 

supp{X,V) = \cover{X,'D) \ . 

The frequency of X in P is its support divided by the number of 
transactions in D, i.e., 

The database D can be omitted from the parameters of these 
functions when 2? is not known or needed. If there are several 
itemset collections Ti,...,Tm with different covers, supports or 
frequencies, we denote the cover, the support and the frequency of 
an itemset X in the collection J-i {1 < i < m) by cover {X,J^i), 
supp{X,J^i) and fr{X,Ti), respectively. 

Based on these definitions, the frequent itemset mining problem 
can be formulated as follows: 

Problem 2.2 (frequent itemset mining |AIS93f ^. Given a 
transaction database T> and a minimum frequency threshold a G 
(0,1], find all (T-frequent itemsets in V, i.e., all itemsets such that 
fr{X, V) > a. The collection of u-frequent itemsets is denoted by 
T{a,V). 

Example 2.2 (frequent itemsets). Let the transaction data- 
base "D consist of transactions {1,ABC), {2,AB), {3, ABCD) and 
{A, EC). Then the frequencies of itemsets in T> are as shown in 
Table ITTl For example, the collection J-{2/4:,T)) of 2/4- frequent 
itemsets in V is {0, A, B, C, AB, AC, BC, ABC}. 

□ 

Probably the most well-known example of frequent itemset min- 
ing tasks is the market basket analysis. In that case the items are 
products available for sale. Each transaction consists of a trans- 
action identifier and a subset of the products that typically corre- 
sponds to items bought in a single purchase, i.e., the transactions 
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Table 2.1: Itcmscts and their frequencies in T). 



X 


fr{X,V) 





1 


A 


3/4 


B 


1 


C 


3/4 


AB 


3/4 


AC 


2/4 


BC 


3/4 


ABC 


2/4 


ABCD 


1/4 



are market baskets. (Alternatively each transaction can correspond 
to all items bought by a single customer, possibly as several shop- 
ping events.) Thus, frequent itemsets are the sets of products that 
people tend to buy together as a single purchase event. 

The frequent itemsets are useful also in text mining. An impor- 
tant representation of text documents is the so-called bag-of-words 
model where a document is represented as a set of stemmed words 
occurring in the document. Thus, items correspond to the stemmed 
words and each document is a transaction. The frequent itemsets 
are the sets of stemmed words that occur frequently together in the 
documents of the document collection. 

Web mining is yet another application of frequent itemsets. There 
each item could be, for example, a link pointing at (from) a certain 
web page and each transaction could the correspond to the links 
pointing from (at) a web page. Then the frequent itemsets cor- 
respond to groups of web pages that are referred concurrently by 
(that refer concurrently) the same web pages. 

2.2.1 Real Transaction Databases 

The purpose of data mining is to analyze data. Without data there 
is not much data mining. Also the methods described in this dis- 
sertation are demonstrated using real data and the patterns discov- 
ered from the data. More specifically, in this dissertation, we use 
the (frequent) itemsets mined from three transaction databases as 
running examples of pattern collections (of interesting patterns). 
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The two main reasons for this are that many data analysis tasks 
can be modeled as frequent itemset mining and frequent itemset 
mining has been studied very actively for more than a decade. 

We use a course completion database of the computer science 
students at the University of Helsinki to illustrate the methods 
described in this dissertation. Each transaction in that database 
corresponds to a student and items in a transaction correspond to 
the courses the student has passed. As data cleaning, we removed 
from the database the transactions corresponding to students with- 
out any passed courses in computer science. The cleaned database 
consists of 2405 transactions corresponding to students and 5021 
different items corresponding to courses. 

The number of students that have passed certain number of 
courses and the the number of students passed each course are 
shown in Figure 12.11 The courses that at least a 0.20-fraction of 
the student in the course completion database have passed (i.e., 
the 34 most popular courses) are shown as Table 12.21 The 0.20- 
frequent itemsets in the course completion database are illustrated 
by Example 12.31 

Example 2.3 (0.20-frequent itemsets in the course comple- 
tion database). Let us denote by J-{a,'D)[i] the cr-frequent item- 
sets in D with cardinality i. Then the cardinality distributions of 
the 0.20-frequent itemsets in the course completion database and 
the most frequent itemsets of each cardinality in that collection are 
as shown in Table 

□ 

The condensation approaches described in ChaptersOHniai'e quan- 
titatively evaluated using two data sets from UCI KDD Reposi- 
tory (http : //kdd. ics .uci . edu/| ): Internet Usage data consisting 
of 10104 transactions and 10674 items, and IPUMS Census data 
consisting of 88443 transactions and 39954 items. 

The transaction database Internet Usage is an example of dense 
transaction databases and the transaction database IPUMS Census 
is a sparse one: in the Internet Usage database only few cr-frequent 
itemsets are contained exactly in the same transactions whereas in 
the IPUMS Census databases many cr-frequent itemsets are con- 
tained in exactly the same transactions. (This holds for many dif- 
ferent values of cr). This means also that most of the frequent 
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item 



Figure 2.1: The number of transactions of different cardinalities 
(top) and the item counts in the course completion database (bot- 
tom). 
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Table 2.2: The courses in the course completion database that at 
least a 0.20-fraction of the students in the database has passed. The 
columns are the rank of the course with respect to the support, the 
number of students that have passed the course, the official course 
code and the name of the course, respectively. 



rank 


count 


code 


name 





2076 


50001 


Orientation Studies 


1 


1587 


99270 


Reading Comprehension in English 


2 


1498 


58160 


Programming Project 


3 


1210 


58123 


Computer Organization 


4 


1081 


58128 


Introduction to UNIX 


5 


1071 


58125 


Information Systems 


6 


1069 


58131 


Data Structures 


7 


1060 


58161 


Data Structures Project 


8 


931 


99280 


English Oral Test 


9 


920 


58127 


Programming in C 


10 


856 


58122 


Programming (Pascal) 


11 


803 


99291 


Oral and Written Skills in the Second Of- 








ficial Language, Swedish 


12 


763 


58162 


Information Systems Project 


13 


760 


58132 


Concurrent Systems 


14 


755 


58110 


Scientific Writing 


15 


748 


58038 


Database Systems I 


16 


744 


57031 


Approbatur in Mathematics I 


17 


733 


581259 


Software Engineering 


18 


709 


57019 


Discrete Mathematics I 


19 


697 


581330 


Models for Programming and Computing 


20 


695 


50028 


Maturity Test in Finnish 


21 


677 


581327 


Introduction to Application Design 


22 


655 


581326 


Programming in Java 


23 


651 


581328 


Introduction to Databases 


24 


650 


581325 


Introduction to Programming 


25 


649 


581256 


Teacher Tutoring 


26 


628 


57013 


Linear Algebra I 


27 


586 


57274 


Logic I 


28 


585 


580212 


Introduction to Computing 


29 


568 


581324 


Introduction to the Use of Computers 


30 


567 


58069 


Data Communications 


31 


564 


581260 


Software Engineering Project 


32 


520 


57032 


Approbatur in Mathematics II 


33 


519 


581329 


Database Application Project 



16 



2 Pattern Discovery 



Table 2.3: The number of the 0.20-frequent itemsets of each cardi- 
nahty in the course completion database, the most frequent itemsets 
of each cardinality and their supports. 



i 




the largest X G T{a,V){i\ 


supp{X, V) 





1 





2405 


1 


34 


{0} 


2076 


2 


188 


{0,1} 


1345 


3 


474 


{2,3,5} 


960 


4 


717 


{0,2,3,5} 


849 


5 


626 


{0,2,3,4,5} 


681 


6 


299 


{0,2,3,5,7,12} 


588 


7 


72 


{2,3,5,7,12,13,15} 


547 


8 


8 


{0,2,3,5,7,12,13,15} 


512 



itemsets in Internet Usage are closed whereas most of the frequent 
itemsets in IPUMS Census are not. (See Definition 12.101 for more 
details on itemsets being closed.) 

2.2.2 Computing Frequent Itemsets 

The frequent itemset mining problem has been studied extensively 
for more than a decade and several efficient search strategies have 
been developed, see e.g. |A1^93l IAMS+961 RTToHl IIIP VMo41 IZakOOj . 
Most of the techniques follow the generate-and-test approach: the 
collection of frequent itemsets is initialized to consist of the empty 
itemset with support equal to the number of transactions in the 
database. (This is due to the fact that the empty itemset is con- 
tained in each transaction which means also that its frequency is 
one.) Then the collections of itemsets that might be frequent are 
generated and tested repeatedly until it is decided that there are 
no more itemsets that are not tested but could still be frequent. 
The most important property of frequent itemsets for search space 
pruning and candidate generation is the anti-monotonicity of the 
supports with respect to the set inclusion relation. 

Observation 2.1. If X Q Y , then supp{X,V) > supp{Y,V). 
Thus, all subitemsets of frequent itemsets are frequent and all su- 
peritemsets of infrequent itemsets are infrequent. 
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This observation is largely responsible for the computational 
feasibility of the famous frequent itemset mining algorithm Apri- 
ORI AMS"'"96 in practice. It or some of its variant is extensively 



used in virtually all frequent itemset mining methods. 



2.2.3 Association Rules 

The itemsets that are frequent in the database are itself summaries 
of the database but they can be considered also as side-products of 
finding association rules. 

Definition 2.4 (Association rules). Let T> he a transaction da- 
tabase. An association rule is an implication of form X ^ Y such 
that X,Y QZ. The itemset X is called the body (or the antecedent) 
of the rule and the itemset Y is known as the head (or the conse- 
quent) of the rule. 

The accuracy of the association rule X ^ Y is denoted by 

its support supp{X =^ Y,T)) is equal to supp{X U Y,!)) and the 
frequency of the association rule X ^ Y is 

An association rule is called simple if the head is a singleton. 

To avoid generating redundant association rules, it is usually as- 
sumed that the body X and the head Y of the rule X ^ Y are dis- 
joint. Instead of all association rules, typically only the a-frequent 
association rules, i.e., the association rules with frequency at least 
a are computed. The intuition behind this restriction is that the 
support of the association rule immediately tells how many transac- 
tions in the database the association rule concerns. Another reason 
for concentrating only to cr-frequent association rules is that they 
can be computed from the cr-frequent itemsets by a straightforward 
algorithm (Algorithm EHJ |AIS93j . 
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Algorithm 2.1 Association rule mining. 

Input: A collection !F{a, V) of cr-frequent itemsets in a transaction 

database V. 

Output: The collection TZ of association rules over the collection 

1: function ASSOCIATION-RULES (:F(cr, 2?)) 
2: 7^^0 

3: for all Y G J^(o-, V) do 
4: for all X C y do 

5: n^nu{x ^Y\x} 

6: end for 

7: end for 
8: return TZ 

9: end function 



2.3 From Frequent Itemsets to Interesting 
Patterns 

The definition of frequent itemsets readily generalizes to arbitrary 
pattern collections V and databases V such that the frequency of 
a pattern p G P in the database D can be determined. Also asso- 
ciation rules can be defined for a pattern collection V if there is a 
suitable partial order ^ over the collection. 

Definition 2.5 (partial order). A partial order ^ is a transitive, 
antisymmetric and reflexive binary relation, i.e., a relation :<Q'P x 
V such that p ^ p' A p' ^ p" ^ p ^ p", p < p' /\p' :< p ^ p = p' 
and p ^ p ion all p-,p' ip" £ V. (Note that p ^ p' is equivalent to 
{PiP') G^.) We use the shorthand p <p' when p <p' but p' p. 

Elements p,p' € V are called comparable with respect to the 
partial order -< if and only \ip or -< p. If the elements are 
not comparable, then they are incomparable. A partial order is a 
total order in V if and only if all p,p' &V are comparable. 

Association rules can be defined over the pattern collection V 
and the partial order < over V '\ip :<p' implies /r(p, T>) > fr{p', V) 
for all p,p' G V. For example, itemsets are a special case of this: 
one such partial order ^ over the collection of all itemsets X CI 
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is naturally defined by the set inclusion relation 

X X QY 

holding for all X,Y Q 2. Then, by the definition of the frequency 
of itemsets (Definition EjU , X CY implies fr{X,V) > fr{Y,V) for 
all X,Y <ZI. 

Definition 2.6 (frequent patterns and their association rules). 

Let "P be a pattern collection, T> a database, a a positive value in 
the interval [0, 1] , and for each pattern p € V, let fr(p, V) denote the 
frequency of p in T>. The collection J- {a, T>) of cj-frequent patterns 
consists of the patterns p € V such that fr{p, V) > a. 

Let ^ be a partial order over the pattern collection V and let p ^ 
p' imply fr{p,T>) > fr{p' ,T>) for all p,p' G V. Then an association 
rule is a rule p ^ p' where p,p' G V and p < p' . The accuracy of 
an association rule p ^ p' is 

acc{p^p,V) = 

fr{p,V) 

The association rules can be generalized also for incomparable 
patterns p,p' £ V by defining 

acc{p^p,V) = . 

fr{p,V) 

where p" is such a pattern in V that p,p' ^ p" and fr{p",T>) > 
fr{p"',V) for all p'" € V with such that p,p' ^ p'" . 

Example 2.4 (frequent substrings and association rules). 

Let s be a string over an alphabet S and let the frequency of a 
string p = pi . . . G S* in s be the number of its occurrences in 
s divided by the length of s, i.e., 

\s\ - IpI + 1 

Furthermore, let the partial order ^ over the strings in E* be the 
substring relation, i.e., 

s ^t <;=^ 3i e {0,..., \t\ - \s\} : s = ti+i . . . 
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for all s,t G S*. As p ^ p' then implies fr{p,s) > fr{p',s), the 
association rules can be defined for substrings. 

The frequencies of all strings in s can be computed in time Ods]) 
by constructing a suffix tree or a suffix array of s. (For details on 
linear-time suffix tree and array constructions, see e.g. jFCFMOOl 

IHElZllESml.) □ 

The previously outlined search strategies to find o"-frequent item- 
sets and association rules have been adapted to many kinds of pat- 
terns such as sequences |WH()4[ IZaEn] . episodes CGOSb', -GASUai 
LMTV97', trees ;XYLD().'-i[ IZak()2'. graphs 'lW M()8tiKKnit.WWS+n2l 
IYHn2 and queries |DT()1[ I^VdB02. MS03 . 

The interestingness predicate obtained by a minimum frequency 
threshold determines a downward closed pattern collection for many 
kinds of patterns. 

Definition 2.7 (downward closed pattern collections). A 

pattern collection V is downward closed with respect to a partial 
order ^ and an interestingness predicate q if and only if p G Vq 
implies that p' € Vq for all p' ^ p. 

Many of the pattern discovery techniques are adaptations of the 
general levelwise search strategy for downward closed collections 
of interesting patterns |MT97j . The search procedure repeatedly 
evaluates all patterns whose all subpatterns are recognized to be 
interesting. The procedure is described by Algorithm 12.21 (which is 
an adaptation from |MT97j ). 

Algorithm 12.21 can be modified in such a way that the require- 
ment of having downward closed pattern collection can be relaxed. 
Specifically, it is sufficient to require that the collection of poten- 
tially interesting patterns that has to be evaluated in the levelwise 
search is downward closed in the sense that there is a way to ne- 
glect other patterns in the collection. (For an example, see subsec- 
tion 0X1) 

2.4 Condensed Representations of Pattern 
Collections 

A major difficulty in pattern discovery is that the pattern collec- 
tions tend to be too large to understand. Fortunately, the pattern 
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Algorithm 2.2 The levelwise algorithm for discovering interesting 
patterns. 

Input: A pattern collection V, a partial order ^ over V and an 
interestingness predicate q : V ^ {0, 1} such that p ^ p' implies 
q{p) > lip') for all p,p' € V. 

Output: The collection Vq of interesting patterns in V. 
1: function Levelwise(7', ^, q) 

2: "Pg <— \> No pattern is known to be interesting. 

3: V' V > All patterns are potentially interesting. 

4: repeat > Find the minimal still potentially interesting 

patterns and check whether they are interesting. 
5: K, ^ {p £V' :p' £'P,p' ^p=> p' eVq} 

6: Vq^TqU{peJC: q{p) = 1} 

7: v ^v'\k: 

8: until /C = 
9: return Vq 
10: end function 



collections contain often redundant information and many patterns 
can be inferred from the other patterns. That is, the pattern collec- 
tion can be described by its subcollection of irredundant patterns. 
The irredundancy of a pattern does not always depend only on the 
pattern collection and the interestingness predicate but also on the 
other irredundant patterns and the method for inferring all patterns 
in the collection from the interesting ones. 

In pattern discovery literature such collections of irredundant 
patterns are known as condensed (or concise) representations of 
pattern collections |(X7n3aj . although the condensed representa- 
tions in the context of data mining were introduced in a slightly 
more general sense as small representations of data that are accu- 
rate enough with respect to a given class of queries |MT96j . 



2.4.1 Maximal and Minimal Patterns 

Sometimes it is sufficient, for representing the pattern collection, to 



store only the maximal patterns in the collection GKM"'"n3 



Definition 2.8 (maximal patterns). A pattern p € P is maximal 
in the collection V with respect to the partial order -< if and only 
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\l p -/{ p' for all p' £ v. The collection of maximal patterns in V is 
denoted by Max{V, <). 

It can be shown that the maximal interesting patterns in the 
collection determine the whole collection of interesting patterns if 
the interesting patterns form a downward closed pattern collection. 

Proposition 2.1. The collection Max^Pq, :<) of the maximal inter- 
esting patterns determines the collection Vq of interesting patterns 
if and only if Vq is downward closed. 

Proof. If the collection Vq is downward closed, then by the defini- 
tion of maximality, for each pattern p G Vq there is the maximal 
pattern in p' G Vg such that p ^ p' . Furthermore, for each maximal 
pattern p' G Vg it holds p ^ p' ^ p € Vq if Vq is downward closed. 

If the collection Vq is not downward closed, then there is a non- 
maximal pattern p such that p ^ Vg but p ~< p' for some p' E Vq. 
The maximal patterns in Vq are not sufficient to point out that 
pattern. □ 

The maximal patterns in the collection of cj-frequent itemsets, 
i.e., the maximal cr- frequent itemsets in V, are denoted by J-'Ai{cr, T>). 
Representing a downward closed collection of patterns by the max- 
imal patterns in the collection can reduce the space consumption 
drastically. For example, the number \TM{a,T>)\ of maximal fre- 
quent itemsets can be exponentially smaller than the number \ J-{ct, T>)\ 
of all frequent itemsets. 

Example 2.5 (the number of d-frequent itemsets versus the 
number of maximal cj- frequent itemsets). Let us consider a 
transaction database T> consisting only of one tuple For this 

database and all possible minimum frequency thresholds a G [0, 1] 
we have: \FM{cr,V)\ = 1 and \ J'{cr,V)\ =21^1 □ 

Example 2.6 (maximal 0.20-frequent itemsets in the course 
completion database). Let us denote the collection of the maxi- 
mal c7-frequent itemsets in T> with cardinality i by T^A{a^'D)\^\. 
Then the cardinality distributions of the maximal 0.20-frequent 
itemsets in the course completion database (see Subsection I2.2.1j) 
and the most frequent itemsets of each cardinality are as shown in 
Table Eai 
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Table 2.4: The number of the maximal 0.20-frequent itemsets of 
each cardinality in the course completion database, the most fre- 
quent itemsets of each cardinality and their supports. 



i 


\:FM{a,V)[{\\ 


the largest X G J'M{cr,V)\i] 


supp{X, V) 












1 


1 


{33} 


519 


2 


21 


{2,26} 


547 


3 


41 


{0,2,19} 


529 


4 


58 


{0,2,7,17} 


553 


5 


38 


{2,3,5,9,10} 


511 


6 


66 


{0,1,2,3,4,5} 


550 


7 


20 


{0,2,3,5,7,14,20} 


508 


8 


8 


{0,2,3,5,7,12,13,15} 


512 



□ 

Due to the potential reduction in the number of itemsets needed 
to find, several search strategies for finding only the maximal fre- 
quent itemsets ha ve been developed |BCG01..BGKM02..B,T98..GZ0l1 
[n703llGKM+0'3l[HTT03l . 

It is not clear, however, whether the maximal interesting pat- 
terns are the most concise subcollection of patterns to represent 
the interesting patterns. The collection could be represented also 
by the minimal uninteresting patterns. 

Definition 2.9 (minimal patterns). A pattern p G P is minimal 
in the collection V with respect to the partial order -< if and only 
if 7^ p for all p' G P. The collection of minimal patterns in V is 
denoted by Min{V, :<). 

As in the case of the maximal interesting patterns, it is easy 
to see that the minimal uninteresting patterns uniquely determine 
the collection of the interesting patterns if the pattern collection is 
downward closed. 

The collection of minimal cr-infrequent itemsets in T) is denoted 
by IM (cr, D) . It is much more difficult to relate the number of 
minimal uninteresting patterns to the number of interesting pat- 
terns, even when the collection of interesting patterns is downward 
closed. In fact, for a downward collection Vq of interesting patterns 
patterns the number \Min{Vq, ^)[ of uninteresting patterns cannot 
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be bounded very well in general from above nor from below by the 
number \Vq\ of interesting patterns and the number \Max(J^q^^\ 
of maximal interesting patterns. 

Bounding the number of the minimal infrequent itemsets by the 
number of frequent itemsets is also slightly more complex than 
bounding the number of maximal frequent itemsets. 

Example 2.7 (the number of cr- frequent itemsets versus 
the number of minimal cr-infrequent itemsets). The number 
|XA^((T, P)! of minimal infrequent itemsets can be \L\ times larger 
than the number of \ J-{(y^T>^\ frequent itemsets. 

Namely, let the transaction database consists of transaction (1,0). 
Then T[a,V) = {0} but IMia,V) = {{A} -.A el}. This is also 
the worst case since each frequent itemset X G ^(o", P) can have 
at most \I\ superitemsets in ZA4{(7,T)). 

If the collection X7\/f(cj,P) is empty, then \J^{a,V)\ > c\IM{cr,V)\ 
for all values c G M. Otherwise, let the transaction database P con- 
sist of one transaction with itemset I \ {A} for each A I and 
let a = 1/|P|. Then \IM(a,'D)\ is exponentially smaller than 

|^(CT,25)|. □ 

It is known that the number \ J^Ai{a, 'D) \ of maximal itemset can 
be bounded from above by {\I\ -a\V\ + 1) \IM{a,V)\ ifIM{a,V) 
is not empty [BCtKMO^] . Furthermore, it is clear that \IM{a, 'D)\ < 
\I\ \ J-'A4{a,'D)\ for all minimum frequency thresholds a € [0, 1]. 

The collection ZA4{a,T>) can be obtained from TM{a,'D) by 
generating all minimal hypergraph transversals in the hypergraph 

{X\X : X G TM{a,V)}, 

i.e., in the hypergraph consisting of the complements of the maximal 
c7-frequent itemsets in V |MT97j . 

The slack in the bounds between the number of the maximal 
frequent and the number of the minimal infrequent itemsets implies 
that it cannot be decided in advance without seeing the data which 
of the representations — J-A4{a,T>) or IA4{a,T>) — is better. In 
practice, the smaller of the collections J-A4{a,T>) and IA4{a,'D) 
can be chosen. Each maximal frequent and each minimal infrequent 
itemset determines its subitemsets to be frequent and superitemsets 
to be infrequent. Sometimes one can obtain a representation for 
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J^{a,'D) that is smaller than ^7V((cj, P) or ZM(a,T>) by choosing 
some itemsets from J^A4{a,T>) and some from TA1((T, in such 
a way that the chosen itemsets determine the collection J- {a, V) 
uniquely |Mie04cj . 

Sometimes it is not sufficient to represent only the collection 
of interesting patterns but also the quality values for the patterns 
are needed as well. For example, the accuracy of an association 
rule X ^ Y depends on the frequencies of the frequent itemsets 
X and X UY. One solution is to determine the pattern collection 
as described above and describe the quality values in the collec- 
tion of interesting patterns separately. The quality values can be 
represented, e.g., by a simplified database |Mie03cj or by a ran- 
dom sample of transactions from the database |Mien4cj . In these 
approaches, however, the condensed representation is not a sub- 
collection of the patterns anymore. Thus, a different approach is 
required if the condensed representation of the pattern collection is 
required to consist of patterns. 

2.4.2 Closed and Free Patterns 

For the rest of the chapter we shall focus on interestingness mea- 
sures (/) such that p ^ p' implies (j){p) > (/){?') ^ov all p,p' € V, i.e., 
to anti-monotone interestingness measures. Then maximal inter- 
esting patterns and their quality values determine lower bounds for 
all other interesting patterns as well. The highest lower bound ob- 
tainable for the quality value of a pattern p from the quality values 
of the maximal patterns is 

max {(l){p') : p <p' ^ Max{Vq, ^)} . 

The patterns p with the quality value matching with the max- 
imum quality value of the maximal interesting patterns that are 
superpatterns of p can be removed from the collection of poten- 
tially irredundant patterns if the maximal interesting patterns are 
decided to be irredundant. An exact representation for the col- 
lection of interesting patterns can be obtained by repeating these 
operations. The collection of the irredundant patterns obtained by 
the previous procedure is called the collection of closed interesting 
patterns |/()98| . 
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Definition 2.10 (closed patterns). A pattern p € P is closed 
in the collection V with respect to the partial order ~< and the 
interestingness measure (j) if ^-nd only p ~< p' implies (j){p) > 4>{p') 
for all p' £ V. The collection of closed patterns in V is denoted by 
CK^^—^'P)- -^o^ brevity, ^ and (j) can be omitted when they are 
clear from the context. 

The collection of closed u-frequent itemsets in V is denoted by 
J^C{a,'D). One procedure for detecting the closed patterns (for a 
given pattern collection V, a partial order ^ and an interestingness 
measure (p) is described as Algorithm \2.'A[ 

Algorithm 2.3 Detection of closed patterns 

Input: A collection V of patterns, a partial order ^ over V and an 

interestingness measure (p. 
Output: The collection Cl{V, ^, (p) of patterns in V that are closed 
with respect to (p. 
1: function Closed-Patterns(7', -<, 0) 
2: K^V 
3: while /C / do 
4: K,'^Max{K,,^) 
5: Cl{V,^,<p) ^ Cl{V,^,^)UlC' 

6: IC^IC\JC' 

7: JC ^ {p e }C : p' e K.\p ^p' ^ (p{p) > (p{p')} 

8: end while 
9: return CliV, ^,(p) 
10: end function 



Example 2.8 (closed frequent itemsets). Let the transaction 
database P be the same as in Example 12.21 i.e., 

V = {(1, ABC) , (2, AB) , (3, ABCD) , (4, BC)} . 

Then 

TC{2/A, V) = {B, AB, BC, ABC} = T{2/4, V) \ {0, A, C, AC} . 

□ 

Example 2.9 (closed 0.20-frequent itemsets in the course 
completion database). Let us denote the collection of the closed 



2.J^ Condensed Representations of Pattern Collections 



27 



(T-frequent itemsets in V with cardinality i by J-C{a,'D)[i]. Then 
the cardinahty distributions of the closed 0.20- frequent itemsets 
in the course completion database (see Subsection 12.2. Ij) and the 
most frequent closed itemsets of each cardinality are as shown in 
Table 



Table 2.5: The number of the closed 0.20-frequent itemsets of each 
cardinality in the course completion database, the most frequent 
closed itemsets of each cardinality and their supports. 



i 




the largest X e J^C{a,V)\i] 


supp{X, V) 





1 





2405 


1 


34 


{0} 


2076 


2 


186 


{0,1} 


1345 


3 


454 


{2,3,5} 


960 


4 


638 


{0,2,3,5} 


849 


5 


519 


{0,2,3,4,5} 


681 


6 


238 


{0,2,3,5,7,12} 


588 


7 


58 


{2,3,5,7,12,13, 15} 


547 


8 


8 


{0,2,3,5,7,12,13,15} 


512 



□ 

It is a natural question whether the closed interesting patterns 
could be discovered immediately without generating all interesting 
patterns. For many kinds of frequent closed patterns this question 
has been answered positively; there exist methods for mining di- 
rectly, e.g., closed frequent itemsets |PH'l'h99l IPCT+03l IWHP031 
IZH02) . closed frequent sequences |WH0 4 YHA03 , and closed fre- 
quent graphs |YH03j from data. Recently it has been shown that 
frequent closed itemsets can be found in time polynomial in the size 
of the output |UAUA04| . 

The number \Cl{Vq)\ of closed interesting patterns is at most 
the number \T'q \ of all interesting patterns and at least the number 
\Max{Vq)\ of maximal interesting patterns, since Vq 2 Cl{Vq) 3 
Max{Vq). Tighter bounds for the number of closed interesting pat- 
terns depend on the properties of the pattern collection V. 

Example 2.10 (the number of cr-frequent itemsets versus 
the number of closed cr-frequent itemsets). Similarly to the 
maximal frequent itemsets, the number of closed frequent itemsets 
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in the transaction database T) = is exponentially smaller 

than the number of all frequent itemsets for all minimum frequency 
thresholds cr G (0,1]. □ 

However, the number of closed frequent sets can be exponentially 
larger than the number of maximal itemsets. 

Example 2.11 (the number of maximal cj- frequent item- 
sets versus the number of closed cr- frequent itemsets). Let 

T> consist of one transaction for each subset of size |T| — 1 of Z 
and [cr/(l — 0")] \I\ transactions consisting of the itemset X. Then 
TM{g, V) = {1} but J"C(cj, V) = {X (ZI} = 2^. □ 

Example 2.12 (comparing all, closed and maximal cr-fre- 
quent itemsets in the course completion database). Let us 

consider the course completion database (see Subsection 12.2. Ij) . In 
that transaction database, the number of all, closed and maximal 
c7-frequent itemsets for several different minimum frequency thresh- 
olds a are as shown in Table 

Table 2.6: The number of all, closed and maximal c-frequent item- 
sets in the course completion database for several different minimum 
frequency thresholds a. 



a 


all 


closed 


maximal 


0.50 


7 


7 


3 


0.40 


18 


18 


10 


0.30 


103 


103 


28 


0.25 


363 


360 


80 


0.20 


2419 


2136 


253 


0.15 


19585 


12399 


857 


0.10 


208047 


82752 


4456 


0.05 


5214764 


918604 


43386 


0.04 


12785998 


1700946 


80266 


0.03 


38415247 


3544444 


172170 


0.02 


167578070 


8486933 


414730 


0.01 


1715382996 


23850242 


1157338 



The number of maximal cr-frequent itemsets is quite low com- 
pared even to the number of closed cr-frequent itemsets. The num- 
ber of closed cr-frequent itemsets is also often considerably smaller 



2.J^ Condensed Representations of Pattern Collections 



29 



than the number of all c7-frequent itemsets, especially for low values 
of a. □ 
A desirable property of closed frequent itemsets is that they can 
be defined by closures of the itemsets. A closure of an itemset X 
in a transaction database V is the intersection of the transactions 
in V containing X, i.e., 

cl{X,V)= f] Y. 

{i,Y)eV,Y^X 

Clearly, there is unique closure cl{X,V) in the transaction data- 
base T) for each itemset X. It can be shown that each closed itemset 
is its own closure jGWMl |^ry0lj IEBTL99 . Thus, the collection of 
closed (T-frequent itemsets can be expressed alternatively as 

J^C{a, V) = {X CI: cl{X, V) = X, fr{X, V) > a} . 

In fact, this is often used as a definition of a closed itemset. 
In this dissertation, however, the closed patterns are not defined 
using closures; the reason is that it is not clear in the case of other 
pattern collections than frequent itemsets whether the closure can 
be defined in a natural way and when it is unique. 

The levelwise algorithm (Algorithm 12. 2|) can be adapted to mine 
also closed itemsets: Let J^T{a, V) denote the collection of all a- 
frequent items in T> and let J-Ck be the collection of the closed 
frequent itemsets at level k. The level for closed itemsets is the 
length of the shortest path from the itemset to the closure of the 
empty itemset in the partial order defined by the set inclusion re- 
lation. Thus, the zeroth level consists of the closure of the empty 
itemset. The collection of potentially frequent closed itemsets at 
level k {k > \) consists of closures of X U {A\ for each frequent 
closed itemset X in level k — 1 and each frequent item A ^ X. The 
adaptation of Algorithm 12.21 for frequent closed itemset mining is 
described as Algorithm 12.41 

The collection of closed interesting patterns can be seen a re- 
finement of the collection of maximal interesting patterns: a closed 
interesting pattern p is a maximal interesting pattern for the mini- 
mum quality value thresholds in the interval 



(max : p -< p'} , 4>{p)\ 
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Algorithm 2.4 The levelwise algorithm for discovering frequent 

closed itemsets in a transaction database. 

Input: A transaction database T) and a minimum frequency 

threshold cr G (0, 1]. 
Output: The collection J^C{a,V) of cr-frequent closed itemsets in 

V. 

1: function CLOSURES(cr, P) 

2: I ^\Jx(zT)^ 

3: J^I^{AeI: fr{A, V) > a} 

4: i^O 

5: TCi ^ {cl{$,V)} 

6: TC{a,V)^J^Co 

7: repeat 

8: i ^ i + 1 

9: IC ^ {cl{X U {A} ,V) ■.XGj^Ci-i,AeJ^I\X} 

10: J^C, ^ {A € /C : /r(A, V)>a}\ TC{a, V) 

11: TC{a, V) ^ rC{a, V) U Fd 

12: until /C = 

13: return J^C(o-,7:') 
14: end function 



A natural relaxation of the closed interesting patterns is to store 
maximal interesting patterns for several minimum quality value 
thresholds. For example, the collections 

TM{a, V),TM{a + e, P), . . . , FM{(t + a) /e] - 1) e, V) 

of the maximal frequent itemsets are sufficient for estimating the 
frequency of any cr-frequent itemset in V by the maximum absolute 
error at most e. Furthermore, the frequencies of the maximal fre- 
quent itemsets are not needed: it is sufficient to know in which of 
the collections J-Jv[[a,'D),TJv[[a + e,P), . . . the maximal pattern 
belongs to and what is the minimum frequency threshold for that 
collection. Then the frequency of an itemset X can be estimated 
to be the maximum of the minimum frequency thresholds of the 
maximal itemset collections that contain an itemset containing the 
itemset X. 

Algorithm 12.31 can be modified to solve this task of approximat- 
ing the collection of cr-frequent closed itemsets. To approximate 
especially the collections of the frequent itemsets, many maximal 
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frequent itemset mining techniques can be adapted for mining the 
maximal frequent itemset coUections for several minimum frequency 
thresholds, see e.g. iPI)/Hn2j . 

An alternative notion of approximating closed frequent itemsets 
is proposed in |BB00j . The approach readily generalizes to any col- 
lection of interesting patterns with an anti-monotone interesting- 
ness measure: a pattern is considered to be e-closed if the absolute 
difference between its quality value and the largest quality value of 
its superpatterns is more than e. 

Finally, an approach based on simplifying interestingness values 
to approximate closed interesting patterns is described in Chapter El 
of this dissertation and another approximation based on pattern 
ordering with respect to the informativeness of the prefixes of the 
ordering is proposed in Chapter 01 

Instead of defining irredundant patterns to be those that have 
strictly higher quality values than any of their superpatterns, the 
irredundant patterns could be defined to be those that have strictly 
lower quality values than any of their subpatterns. The latter pat- 
terns are called free patterns |iBBR08) , generators |PBTL99j or key 
patterns |BTP+nnj . 



Definition 2.11 (free patterns). A pattern p G P is free in the 
collection V with respect to the partial order -< and the interest- 
ingness measure <j) if and only if p' ^ p implies 0(p) < 0(p') for all 
p' G v. The collection of free patterns in V is denoted by GeniV). 

The collection of free u-frequent itemsets in T> is denoted by 
TQ{(T,T>). Unfortunately, the free interesting patterns Gen{Vq) are 
not always a sufficient representation for all interesting patterns 
but also minimal free uninteresting patterns, i.e., the patterns in 
the collection Min{Cen(Vg)) are needed. 

Example 2.13 (free frequent itemsets). Let the transaction 
database P be the same as in Example 12.21 i.e., 

V = {(1, ABC) , (2, AB) , (3, ABCD) , (4, BC)} . 

Then 



J^a(l/4,P) = {^,A,C,AC} = T{1/4,V)\{B,AB,BC,ABC} 



32 



2 Pattern Discovery 



This is not, however, sufficient to determine the collection of 
1/4- frequent itemsets in T) since there is no information about B 
nor D. The item B is frequent but not free, whereas the item D is 
free but not frequent. □ 

As in the case of closed interesting patterns, the number of free 
interesting patterns is at most the number of all interesting pat- 
terns. The number of free interesting itemsets can be smaller than 
even the number of maximal interesting or minimal uninteresting 
patterns. 

In the case of frequent itemsets, the number of free frequent 
itemsets is always at least as large as the number of closed frequent 
itemsets since each free itemset has a only one closure but several 
free itemsets can share the same one. Although the free frequent 
itemsets seem to have many disadvantages, they have one major 
advantage compared to closed frequent itemsets: collections of free 
frequent itemsets are downward closed BBR03 . Thus, closed fre- 
quent itemsets can be discovered from free frequent itemsets by 
computing the closures for all free frequent itemsets. Notice that if 
free frequent itemsets are used only to compute the closed frequent 
itemsets, the minimal free infrequent itemsets are not needed for 
the representation since for each closed frequent itemset X there is 
at least one free frequent itemset Y such that X = c/(y, P). 

Similarly to closed frequent itemsets, also mining the approx- 
imate free itemset collections based on a few different notions of 
approximation has been studied [BBR.fl3^ ■PDZHfl2) . 

2.4.3 Non-Derivable Itemsets 

Taking the maximum or the minimum of the quality values of 
the super- or subpatterns are rather simple methods of inferring 
the unknown quality values but not much more complex inference 
techniques are useful with arbitrary anti-monotone interestingness 
measures. (Note that this is the case even with arbitrary frequent 
pattern collections since the only requirement for frequency is the 
anti-monotonicity.) For some pattern collections with suitable in- 
terestingness measures it is possible to find more concise represen- 
tations. 

For example, several more sophisticated condensed representa- 
tions have been developed for frequent itemsets [BEDH l(XTfl3a| 
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KryOl . This line of work can be seen to be culminated on non- 
derivable itemsets |(Xirn2j . The idea of non-derivable itemsets is to 
deduce lower and upper bounds for the frequency of the itemset 
from the frequencies of its subitemsets. 

Definition 2.12 (non-derivable itemsets). Let /r and fr denote 
mappings that give upper and lower bounds for the frequency of any 
itemset over X. An itemset X CI is non-derivable with respect to 
the transaction database P (and functions fr and fr) if and only if 
the lower bound fr{X, T>) is strictly smaller than the upper bound 
fr{X,V). The collection of non-derivable itemsets is denoted by 
Af{V). 

One bound for the frequencies can be computed using inclusion- 
exclusion fC'0D2j . (An alternative to inclusion-exclusion would be 
to use (integer) linear programming _BSH02l ICal04aj . However, if 
the bounds for the frequencies are computed from the frequencies of 
all subitemsets, then inclusion-exclusion leads to the best possible 
solution |(]ain4bj .) From the inequality 

^ (-l)IA^I/r(Z,d)>0 

YCZCX 

holding for all X and Y, it is possible to derive upper and lower 
bounds for the frequency of the itemsets X inV |(Xin3aj : 



fr{X,V) = min^ ^ {-l)^^\^^^^fr{Z,V) : \X \Y\ is odd 



fr{X,V) = max<^ ^ (-l)l^\^l+Vr(Z, P) : |X \ y| is even 

^'^^ I^YCZCX 

The collection of non-derivable itemsets is downward closed. The 
largest non-derivable itemset is at most of size [log2 {T^W |CG02j . 
To represent frequent itemsets it is sufficient to store the frequent 
non-derivable itemsets and the minimal infrequent non-derivable 
itemsets with upper bounds to the frequency at least the minimum 
frequency threshold. 

Example 2.14 (non-derivable itemsets). Let the transaction 
database P be the same as in Example 12.21 i.e., 

V = {(1, ABC) , (2, AB) , (3, ABCD) , (4, BC)} . 
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Then A/'(P) = {0, A-B,C,AC}. □ 

The approach of non-derivable itemsets is essentially different 
from the other condensed representations described, as no addi- 
tional assumptions are made about the itemsets with unknown 
frequencies: their frequencies can be determined uniquely using, 
e.g., inclusion-exclusion. In contrast, using closed and free item- 
sets, each unknown frequency is assumed to be determined exactly 
as the maximum frequency of its superitemsets and the minimum 
frequency of its subitemsets, respectively. 

The problem of finding non-derivable representations for essen- 
tially other pattern classes than itemsets is a very important and 
still largely open problem. 



2.5 Exploiting Patterns 

The real goal in pattern discovery is rarely just to obtain the pat- 
terns themselves but to use the discovered patterns. 

One indisputable use of patterns is to disclose interesting aspects 
of the data. The suitability of different ways to represent pattern 
collections for the disclosure depends crucially on the actual appli- 
cation and the goals of data mining in the task at hand. However, 
at least the number of patterns and their complexity affect the un- 
derstandability of the collection. 

In practice, the number of patterns in the representation is strongly 
affected by the application and the database. For example, when 
represented explicitly, the itemset collection consisting only of the 
itemset Z is probably easier to understand than the collection 2^ of 
all subsets of Z. The explicit representation, however, is not always 
to most suitable. 

Example 2.15 (represeting a collection implicitly). Some- 
times the database V can be expected to be so dense that all 
frequent itemsets are also closed, i.e., J^{a,T>) = TC{a,T>) under 
normal circumstances (with respect to the assumptions). If only 
the closed frequent itemsets are being represented, then it is most 
convenient to describe the collection by its maximal itemsets and 
those non-maximal itemsets that are not closed. Thus, it would 
be very surprising if the database V happens to be such that the 
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only closed itemset would be Z, and recognizing exactly that fact 
from the representation of the collection, i.e., the collection 2'^\{X}, 
would be quite arduous. □ 

Also the complexity of the representation can have a significant 
influence to the understandability. For example, the smallest Tur- 
ing machine generating the pattern collection is probably quite an 
unintuitive representation. (The length of the encoding of such a 
Turing machine is called the Kolmogorov complexity or algorithmic 
information of the pattern collection |(]ain2| lLV97j .) Similar situa- 
tions occur also with the condensed representations. For example, 
although the number of non-derivable itemsets is usually less than 
the number of free frequent itemsets, the collection of the free fre- 
quent itemsets might still be more understandable since for most 
of us choosing the minimum value is more natural operation than 
computing all possible inclusion-exclusion truncations. 

Data mining is an exploratory process to exploit the data. The 
data or the patterns derived from the data might not be under- 
standable as whole and the right questions to be asked about the 
data are not always known in advance. Thus, it would be useful 
to be able to answer (approximately) to several queries to pat- 
terns and data. (A database capable to support data mining by 
means of that kind of queries is often called an inductive data- 
base |Boun4| iDRnal IIM96| fMan97j Three most important aspects 
of approximate query answering are the following: 

Representation size. The size of the summary structure needed 
for answering the queries is very important. In addition to 
the actual space required for the storage, the size can affect 
also the efficiency of query answering: it is much more ex- 
pensive to retrieve patterns from, e.g., tertiary memory than 
doing small computations based on patterns in main memory. 
For example, if all cr-frequent itemsets and their frequencies 
fit into main memory, then the frequency queries can be an- 
swered very efficiently for cj-frequent itemsets compared to 
computing the frequency by scanning through the complete 
transaction database that might reside on an external server 
with heavy load. There are many ways how pattern collec- 
tions can be stored concisely. For example, representing the 
pattern collection and their quality values by listing just the 
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quality values leads to quite concise representations |Mien5bj . 

The efficiency of query answering. It is not always known in 
advance what should be asked about the data. Also, the pat- 
tern collections can be too large to digest completely in one 
go. Thus, different viewpoints to data and patterns might 
be helpful. The efficient query answering can be provided by 
efficient index structures. For example, although the num- 
ber of closed frequent itemsets is often considerably smaller 
than the number of all frequent itemsets, retrieving the fre- 
quency of a given frequent itemset can be more difficult. If 
all frequent itemsets are stored, then answering the frequency 
query fr(X, V) can be implemented as a membership query: 
the frequencies of the frequent itemsets can be stored in a trie 
and thus the frequency of a frequent itemset X can be found 
in time linear in \X\. Answering the same query when storing 
only the closed frequent itemsets in a trie is much more dif- 
ficult: in the worst case the whole trie has to be transversed. 
This problem can be relieved by inserting some additional 
links to the trie. The trie representations can be generalized 
to deterministic automata representations |Mie05aj . 

The accuracy of the answers. Sometimes approximate answers 
to queries are sufficient if they can be provided substantially 
faster than the exact answers. Furthermore, it might be too 
expensive to store all data (or patterns) and thus exact an- 
swers might be impossible |BBD"'"02] . A simple approach to 
answer quite accurately to many queries is to store a random 
sample of the data. For example, storing a random subset T>' 
of a transactions in the transaction database T> gives good ap- 
proximations to frequency queries |'lhi9fi[lMie?)4c] . Another 
alternative is to store some subset of itemsets and estimate 
the unknown frequencies from them jKSn2l IMTMI IPMSn.Sj . 
A natural fusion of these two approaches is use both pat- 
terns and data to represent the structure facilitating the pos- 
sible queries IGGMOS] . When the query answers are inaccu- 
rate, it is often valuable to obtain some bounds to the errors. 
The frequencies of the frequent itemsets, for example, can be 
bounded below and above by, e.g., linear programming and 
(truncated) inclusion-exclusion |BSHn2| l(XTfl2j . 



CHAPTER 3 



Frequency- Based Views to 
Pattern Collections 



It is a highly non-trivial task to define an (anti-monotone) inter- 
estingness measure (j) such that there is a minimum quality value 
threshold a capturing almost all truly interesting and only few un- 
interesting patterns in the collection. One way to augment the in- 
terestingness measure is to define additional constraints for the pat- 
terns. The use of constraints is a very important research topic in 
pattern discovery but the research has been concentrated mostly on 
structural constraints on patterns and pattern collections |B(irMP(l3l 
llj.lA(]0()I IDE,TLMn2. GVdBOO, KC^ B WT)3l IIXNnHl IMieOHcl Iti Vk^1\ . 
Typical examples of structural constraints for patterns are con- 
straints for items and itemsets: an interesting itemset can be re- 
quired or forbidden to contain certain items or itemsets. Other 
typical constraints for pattern collections are monotone and anti- 
monotone constraints such as minimum and maximum frequency 
thresholds, or minimum and maximum cardinality constraints for 
the itemsets. 

Example 3.1 (constraints in itemset mining). Let the set Xof 

items be products sold in a grocery store. The transaction database 
T) could then consist of transactions corresponding to purchases of 
customers that have bought something from the shop at least three 
times. As a constrained itemset mining task, we could be interested 
to find itemsets that 

1. do not contain garlic. 
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2. consist of at least seven products, 

3. contain at least two vegetables or bread and sour milk, and 

4. cost at most ten euros. 

These constraints attempt to characterize global travelers that are 
likely to become low-profit regular customers. 

The first and the third constraint are examples of constraints 
for items or itemsets. The second and the fourth constraints are 
examples of anti-monotone and monotone constraints, respectively. 

Clearly, all constraints could be expressed as boolean combina- 
tions of item constraints, since that is sufficient for defining any 
subcollection of 2^ and all constraints define a subcollection of 2-^. 
However, that would not be very intuitive and also it could be com- 
putationally very demanding to find all satisfying truth assignments 
(corresponding to itemsets) for an arbitrary boolean formula. □ 

In this chapter we propose a complementary approach to fur- 
ther restrict and sharpen the collection of interesting patterns. The 
approach is based on simplifying the quality values of the patterns 
and it can be seen as a natural generalization of characterizing the 
interesting patterns by a minimum quality value threshold a for 
the quality values of the patterns. The quality value simplifications 
can be adapted easily to pattern classes of various kind since they 
depend only on the quality values of the interesting patterns and 
not on the structural properties of the patterns. Simplifying the 
quality values is suitable for interactive pattern discovery as post- 
processing of a pattern collection containing the potentially inter- 
esting patterns. For example, in the case of itemsets, the collection 
of potentially interesting patterns usually consists of the cr-frequent 
itemsets for the smallest possible minimum frequency threshold a 
such that the frequent itemset mining is still feasible in practice. 

In addition to making the collection more understandable in gen- 
eral, the simplifications of the quality values can be used to reduce 
the number of interesting patterns by discretizing the quality val- 
ues and removing the patterns whose discretized quality values can 
be inferred (approximately) from the quality values of the patterns 
that are not removed. Although there might be more powerful 
ways to condense the collection of interesting patterns, the great 
virtue of discretization is its conceptual simplicity: it is relatively 
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understandable how the discretization simphfies the structure of 
the quahty values in the collection of interesting patterns. 

This chapter is based on the article "Frequency-Based Views to 
Pattern Collections" |Mie03dj . For brevity, we consider for the rest 
of the chapter frequencies instead of arbitrary quality values. 

3.1 Frequency- Based Views 

A simplification of frequencies is a mapping -0 : [0,1] I, where 
I is a collection of non-overlapping intervals covering the interval 



Example 3.2 (frequent patterns). The collection T{a,'D) of a- 
frequent patterns can be defined using frequency simplifications as 
follows: 



There are several immediate applications of frequency simplifi- 
cations. They can be used, for example, to focus on some particular 
frequency-based property of the pattern class. 

Example 3.3 (focusing on some frequencies). First, exam- 
ple EIH is an example of focusing on some frequencies. 

As a second example, the data analyst might be interested only 
in very frequent (e.g., the frequency is at least 1 — e) and very infre- 
quent (e.g., the frequency is at most e) patterns. Then the patterns 
in the interval (e, 1 — e) could be neglected or their frequencies could 
be mapped all to the interval (e, 1 — e). Thus, the corresponding 
frequency simplification is the mapping 



[0,1], i.e., 

I C{[a,b],[a,b),{a,b],{a,b) C [0, 1]} 
such that U = [0> 1] and i fl j = for all i,j E /. 




if fr{p, D) > a and 
otherwise. 



□ 



(e,l-e) iifr{p,V) e (e, 1 - e) and 
fr{p,'D) otherwise. 



As a third example, let us consider association rules. The data 
analyst might be interested in the rules with accuracy close to 1/2 
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(e.g., within some positive constant e), i.e., the association rules 
P ^ p' [p^p' ^V^p ^ p') with no predictive power. Thus, in that 
case the frequency simphfication ^{acc{p" ,T>)) of the association 
rule p ^ p' (denoted by a shorthand p") is 

r [0,1/2 -e) if acc(p",P) < 1/2 - e, 
i^{acc{p" ,V)) = I (1/2 + 6,1] if acc(p",P) > 1/2 + e and 
acc{p",'D) otherwise. 

□ 

Frequency simplifications are useful also in condensing collec- 
tions of frequent patterns. For an example of this, see Section ESI 
Other potential applications are speeding up the pattern discovery 
algorithms, hiding confidential information about the data from the 
pattern users, correcting or indicating errors in data and in frequent 
patterns, and examining the stability of the collection of frequent 
patterns. 

Although the frequency simplifications in general may require a 
considerable amount of interaction, defining simple mappings from 
the unit interval [0, 1] to a collection of its subintervals and applying 
the simplification in pattern discovery is often more tractable than 
defining complex structural constraints with respect to definability 
and computational complexity. Here are some examples of simple 
mappings: 

• Points in a subinterval of [0, 1] can be replaced by the subin- 
terval itself. 

• The points can be discretized by a given discretization func- 
tion. 

• Affine transformations, logarithms and other mappings can 
be applied to the points. 

Note that the simplification does not have to be applicable to all 
points in [0, 1] but only to the finite number of different frequencies 
fr{p, T)) of the patterns at hand. 

The frequency simplifications have clearly certain limitations, as 
they focus just on frequencies, neglecting the structural aspects of 
the patterns and the pattern collection (although the structure of 
the pattern collection can be taken into account indirectly when 
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defining the simplification). For example, sometimes interesting 
and uninteresting patterns can have the same frequency. Never- 
theless, the frequency simplifications can be useful in constrained 
pattern discovery as a complementary approach to structural con- 
straints. Furthermore, the simplifications could be used to aid in 
the search for advantageous constraints by revealing properties that 
cannot be expressed by the frequencies. 

3.2 Discretizing Frequencies 

Discretization is an important special case of simplifying frequen- 
cies. In general, discretizations are used especially for two purposes: 
reducing noise and decreasing the size of the representation. As an 
example of these, let us look at /c-means clusterings. 

Example 3.4 (fc-means clustering). The fc-means clustering of 
a (finite) point set P C. M!^ tries to find a set O of k points in M*^ 
that minimize the cost 

d 

^mm^(p,-o,)'. 
peP i=i 

This objective can be interpreted as trying to find the centers of 
k Gaussian distributions that would be the most likely to generate 
the point set P. Thus, each point in P can be considered as a 
cluster center plus some Gaussian noise. 

The representation of the set P by the cluster centers is clearly 
smaller than the original point set P. Furthermore, if the centers 
of the Gaussian distributions are far enough from each other, then 
the points in P can be encoded in smaller space by expressing for 
each point p € P the cluster o G O where it belongs and the vector 
p — o. 

Note that in practice, the A;-means clusterings are not always 
correct ones, even if the assumption of k Gaussian distributions 
generating the set P is true, because the standard algorithm used 
for fe-means clustering (known as the k-means algorithm) is a greedy 
heuristic. Furthermore, even if it were known to which cluster each 
of the points in P belongs to, the points in each cluster rarely 
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provide the correct estimate for the cluster center. (For more details 



A discretization of frequencies can be defined as follows: 

Definition 3.1 (discretization of frequencies). A. discretization 
of frequencies is a mapping 7 from [0, 1] to a (finite) subset of [0, 1] 
that preserves the order of the points. That is, '\i x,y G [0, 1] and 
X < y then 7(2;) < 7(y). Points in the range of the discretization 
function 7 are called the discretization points of 7. 

Example 3.5 (discretization of frequencies). Probably the 
simplest example of discretization functions is the mapping 7 that 
maps all frequencies in [0,1] to some constant c G [0,1]. Clearly, 
such 7 is a mapping from [0, 1] to a finite subset of [0, 1] and 
X < y =^ 7(x) < 7(y) for all X, y S [0, 1]. □ 

One often very important requirement for a good discretization 
function is that it should not introduce much error, i.e., the dis- 
cretized values should not differ too much from the original values. 
In the next subsections we prove data-independent bounds for the 
errors in accuracies of association rules with respect to certain dis- 
cretization functions of frequencies and give algorithms to minimize 
the empirical loss of several loss functions. 

To simplify the considerations, the frequencies of the patterns 
are assumed to be strictly positive for the rest of the chapter. 

3.2.1 Loss Functions for Discretization 

The loss functions considered in this section are absolute error and 
approximation ratio. 

The absolute error for a point x £ (0, 1] with respect to a dis- 
cretization function 7 is 



and the maximum absolute error with respect to a discretization 
function 7 for a finite set P C (0, 1] of points is 




□ 



L{x,-/) 



X — 7(x)| 




(3.1) 
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In addition to the absolute error, also the relative error, i.e., 
the approximation ratio is often used to evaluate goodness of the 
approximation. The approximation ratio for a point x S (0, 1] is 



X 



and the maximum approximation ratio interval with respect to a 
discretization function for a finite set P C (0, 1] is 



min ir{x, 'y),maxir{x, 7) 



(3.2) 



Let ^(x,7) denote the loss for a point x £ P with respect to a 
given discretization 7. Sometimes the most appropriate error for a 
point set is not the maximum error max^^p i{x,^) but a weighted 
sum of the errors of the points in P. If the weight function is 
w : P then the weighted sum of errors is 

^««(^',7) = ^«^(^)^(^,7)- (3.3) 

xeP 

In the next few subsections we derive efficient algorithms for min- 
imizing these loss functions defined by Equation 13.11 Equation 13.21 
and Eouation 13.31 



3.2.2 Data- Independent Discretization 

In this subsection we show that the discretization functions 

(3.4) 
and 

7^(:E) = (l-6)l+2L{ln.)/(21n(l-.))J (3^5) 

are the worst case optimal discretization functions with respect to 
the maximum absolute error and the maximum approximation ra- 
tio interval, respectively. Furthermore, we bound the maximum 
absolute error and the intervals for approximation ratios for the 
accuracies of association rules computed using the discretized fre- 
quencies. 

Let us first study the optimality of the discretization functions. 
The discretization function 7^ is optimal in the following sense: 



Yaix) = e + 2e 



2f 
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Theorem 3.1. Let P C (0, 1] be a finite set. Then 

Furthermore, for any other data-independent discretization function 
7 with less discretization points, i{P' ,"f) > e for some point set 
P' C (0,1] such that \P\ = \P'\. 



Proof. For any point x € (0,1], the absolute error ^a(x,7^) 
respect to the discretization function 7^ is at most e since 



with 



2e 



X 

Ye 



<x <2e + 2e 



2e 



and 



Yaix) = e + 2e 



x 
Ye 



Any discretization function 7 can be considered as a collection 
7~^ of intervals covering the interval (0,1]. Each discretization 
point can cover an interval of length at most 2e when the max- 
imum absolute error is allowed to be at most e. Thus, at least 
[l/(2e)] discretization points are needed to cover the whole inter- 
val (0, 1] . The discretization function 7^ uses exactly that number 
of discretization points. □ 

It can be observed from the proof of Theorem 13.11 that some 
maximum error bounds e are unnecessary high. Thus, the maxi- 
mum absolute error bound e can be decreased without increasing 
the number of discretization points. 

Corollary 3.1. The bound e for the maximum absolute error can 
be decreased to 



without increasing the number of discretization points when dis- 
cretizing by the function 7^ . 

The worst case optimality of the discretization function 7^ can 
be shown as follows: 



Theorem 3.2. Let P C (0, 1] be a finite set. Then 



3.2 Discretizing Frequencies 



45 



Furthermore, for any other data-independent discretization function 
7 with less discretization points we have 



£(P',7)g 



1-e, 



1-e 



for some point set P' C (0, 1] such that \P\ = \P'\- 
Proof. Clearly, 



Inx 



21n(l - e) 



< 



Inx 



21n(l 



< 1 + 



Inx 



21n(l 



holds for all X > and we can write 

(1 _ e)a'i^)/(Ml-e)) = (1 - e)2(lna;)/(21n(l-e)) 



X 



Thus, 



(1 



vl+2(lnj;)/(21n(l-e)) 



X 



< 



(1-e) 



l+2L(lna;)/(21n(l-e))J 



(1 



_ N-l+2+2L(lna:)/(21n(l-e))J 



< 



(1 



X 

-l+2(lna;)/(2 In(l-e)) 



X 



1-e 



The discretization function 7^ is the worst case optimal for any 
interval [x, 1] C (0, 1], since it defines a partition of [x, 1] with max- 
imally long intervals. □ 

Furthermore, the discretization function with the maximum ab- 
solute and the maximum relative approximation error guarantees 
gives guarantees for the maximum relative and the maximum ab- 
solute errors, respectively, as follows. 

Theorem 3.3. A discretization function with the maximum abso- 
lute error e guarantees that a discretization of a point x G (0, 1] has 
the relative error in the interval [1 — e/x, 1 -|- e/x]. 

Proof. By definition, the minimum and the maximum discretization 
errors of a discretization function with the maximum absolute error 
at most e are (x — e) /x = 1 — e/x and {x -\- e) /x = 1 -\- e/x. □ 
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Theorem 3.4. A discretization function with the maximum relative 
error in the interval [1 — e, 1 + e] guarantees that a point x G (0, 1] 
has the maximum absolute error at most ex. 

Proof. The discretization 7(0;) of x with the maximum relative error 
in the interval [1 — e, 1 + e] is in the interval [(1 — e) x, (1 + e)x]. 
Thus, the maximum absolute error is 



An important use of frequent patterns is to discover accurate 
association rules. Thus, it would be very useful to be able to bound 
the errors for the accuracies of the association rules. For simplicity, 
we consider association rules over itemsets although all following 
results hold for any pattern collections and quality values. 

Let us first study how well the maximum absolute error guaran- 
tees for frequency discretizations transfer to the maximum absolute 
error guarantees for the accuracies of association rules. 

Theorem 3.5. Let Y be a discretization function with the maxi- 
mum absolute error e. The maximum absolute error for the accu- 
racy of the association rule X ^ Y when the frequencies fr{X U 
Y,T>) and fr{X,T>) are discretized by 7^ is at most 



Proof. By definition, a discretization function preserves the order 
of points in the discretizations. Because fr{X [JY,T>) < fr{X,'D), 
we have YifriX U Y, V)) < -f'{fr{X, V)). 

Since the correct accuracies are always in the interval [0, 1] , the 
maximum absolute error is at most 1. 

The two extreme cases are 



1. when fr{XVJY, V) = fr{X, d)-d>Ofor arbitrary small 6 > 0, 
but 7"(/r(X U Y, V)) = fr{X U F, X>) - e and Y{fr{X, V)) = 
fr{XUY,V)-\-e, and 



max {x — {1 — e) x, {1 -\- e) x — x} = ex 



as claimed. 



□ 




2. when fr{X UY,V) = fr{X, cZ) - 2e + 5 > for arbitrary small 
S>0, but Yifr{X U Y,V)) = Y{fr{X,V)). 
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In the first case, the worst case absolute error is at most 
fr{X UY,V)-e fr{X U Y, V) 



< 



fr{X UY,V) + e 
fr{X UY,V)-e 



fr{X,V) 
fr{XUY,V) 



< 



< 



fr{X,V)+e fr{X,V) 
efr{X,V)+efr(XUY,V) 
fr{X,Vy + efr{X,V) 
2efr{X,V) 
fr{X,Vy + efriX,V) 
2e 



fr{X,V) + e- 

In the second case, the absolute error in the worst case is at most 

fr{X,V)-2e ^ 2e 
fr{X,V) fr{X,V) 

when fr{X, V)>2e. 

Thus, the second case is larger and gives the upper bound. □ 

Note that in the worst case the maximum absolute error can 
indeed be 1 as shown by Example 13.61 

Example 3.6 (the tightness of the bound for 7^ and any 

e > 0). Let fr{X\JY,V) = 5 and fr{X,V) = 2e-5. Then 7^(/r(XU 
Y,V))=Ya{HX,V)) = e. Thus, 

/r(xuy,p) 



1 



fr{X,V) 













^ 2e-5 



1 



when (5 — > 0. 



□ 



When the maximum absolute error for the frequency discretiza- 
tion function is bounded, also the maximum relative error for the 
accuracies of the association rules computed from discretized and 
original frequencies can bounded as follows: 

Theorem 3.6. Let be a discretization function with the maxi- 
mum absolute error e. The approximation ratio for the accuracy of 
the association rule X ^Y, when the frequencies fr[X\JY^T)) and 
fr{X,'D) are discretized using the function j^, is in the interval 



max < 



friXUY,V)-e\ fr{X,V) 



fr{X U y, P) + e J ' fr{X U Y, V) 
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Proof. The smahest approximation ratio is obtained when fr(X, D) = 
fr[X yjY^V) + 6 where 5 is an arbitrary smah positive value, but 
7^(/r(Xuy,P)) =/r(Xuy,P)-eand 7^(/r(X, P)) = /r(X, + e. 
Then the approximation ratio is 

fr{X\JY,V)-e f frjX UY,V) \-^ 

fr{X,V)+e V fr{X,V) J 

frjX U Y, V)fr{X, V) - efrjX, V) 
fr{X U y, V)fr{X, V) + efr{X U Y, V) 
frjX U Y, Vf + 5fr{X U P) - efr{X [JY,V) - 6e 

MX U Y, Vy + 5fr{X UY,V) + efr{X UY,V) ' 

If (5 ^ 0, then 

frjX UY,V)-e / frjX UY,V) \ fr{XUY,V)-e 
fr{X,V) + e V HX,V) ) ^ fr{X UY,V) + e' 

Note that in that inequahty, we assume that fr{X, D) > fr{X U 
y>^) > e because, by Definition 13.11 all discretized values are non- 
negative. Hence, we get the claimed lower bound. 

By the definition of the approximation ratio, the upper bound 
is obtained when fr{X, V) / fr{X U Y, V) but 7"(/r(X \JY,V)) = 
7'^(/r(X, D)). The greatest approximation ratio is obtained when 
fr{X \JY,V) = fr{X, V) - 2e + 5 for arbitrary small (5 > but 
'y^{fr{X U Y,!))) = 7'^(/r(X, P)). Then the approximation ratio is 

7^(/r(X U Y, V)) / fr{X, V)-2e + 6 \ fr{X,V) 
Y{fr{X,V)) V fr{X,V) J ~"fr{X,V)-2e 

when 6^0. If fr{X, D) 2e, then the ratio increases unbound- 
edly. □ 

The worst case the relative error bounds for the discretization 
function 7^ are the following. 

Example 3.7 (the worst case relative error bounds of 7^). 

The smallest ratio is achieved when fr{X, T>) = 2ke and fr{X U 
y, V) = 2ke—5 for arbitrary small 6 > Q and some A; E {1, . . . , [1 /ej }. 
The ratio 

(2A;-l)e/(2fc + l)e 
{2ke - 5) /2e 
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is minimized by choosing k = 1. Thus, the lower bound for the 
relative error is 1/3. 

The relative error cannot be bounded above since the frequen- 
cies fr{X, V) = 2e — 5 and fr{X [JY,^)) = 5 with discretizations 
7(/r(X, V)) = j{fr{X UY,V)) give the ratio 

5/{2e-5) S 

when S ^ and e > 0. □ 

The relative error for the accuracies of the association rules can 
be bounded much better when discretizing by the discretization 
function 7^ having the approximation ratio guarantees instead of 
the maximum absolute error guarantees. 

Theorem 3.7. Let 7*^ be a discretization function with the approx- 



imation ratio in the interval 



a-€),(l-e)-' 



. The approxima- 



tion ratio for the accuracy of the association rule X ^ Y when the 
frequencies fr{X UY,!)) and fr{X,'D) are discretized hy Y the 
interval 

Proof. By choosing 

f (/r(X U Y, V)) = (1 - e) fr{X U Y, V) 



and 
we get 



^'{fr{X,V)) = {l-er^fr{X,V) 



(l-e)/r(Xuy,P) ^ 2 /r(Xuy,P) 
{l-e)-^fr{X,V) ^ friX,V) " 

By choosing 

j'{fr{X UY,V)) = {l-e) fr{X U Y, V) 

and 

^'{fr{X,V)) = {l-er^fr{X,V) 

we get 

(l-erVr(Xuy,P) _ 2M^UY^ 
(l-e)/r(X,P) ^ ^> fr{X,V) ' 

It is easy to see that these are the worst case instances. □ 
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Note that these bounds are tight also for the discretization func- 
tion 7^. 

The discretization functions with the maximum absolute error 
guarantees give also some guarantees for the approximation ratios 
of accuracies. 

Theorem 3.8. Let 7*^ he a discretization function with the approx- 



imation ratio in the interval 



Then the maxi- 



mum absolute error for the accuracy of the association rule X 
when the frequencies fr{X U Y,T>) and fr{X,'D) are discretized by 
7^ is at most 

l_(l_,)2 = 26(l-6). 

Proof. There are two extreme cases. First, the frequencies /r(X, P) 
and fr{X U Y, V) can be almost equal but be discretized as far as 
possible from each other, i.e., 

(l-e)/r(Xuy,P) /r(Xuy,P) 



(1 



{l-e)-^fr{X,V) 
^2fr{XUY,V) 



fr{X,V) 
fr{XUY,V) 



fr{X,V) 



friX,V) 



= (l-(l-e) 



^\fr{XUY,V) 



fr{X,V) 

The maximum value is achieved by setting fr(X\JY, T>) = fr{X, T>) — 
5 for arbitrary small (5 > 0. 

In the second case, the frequencies fr{X, V) and fr{X\JY, V) are 
discretized to have the same value although they are as apart from 
each other as possible. That is, 

[l-e)-^fr{X\JY,V) fr{X UY,V) 



{l-e)fr{X,V) fr{X,V) 
fr{XUY,V) fr{XUY,V) 



il-effr{X,V) fr{X,V) 
1 AfriXUY,V) 



(1 - ey 



- 1 



fr{X,V) 



l-{l-effr{XUY,V) 
{1-ef fr{X,V) • 

However, in that case fr{X U y,I>) < (1 - ef fr{X,V). Thus, the 



maximum absolute error is again at most 1 — (1 — e) . 



□ 
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In this subsection we have seen that data-independent discretiza- 
tion of frequencies with approximation guarantees can provide ap- 
proximation guarantees also for the accuracies of the association 
rules computed from the discretized frequencies without any a pri- 
ori information about the frequencies (especially when the frequen- 
cies are discretized using a discretization function with the approx- 
imation ratio guarantees). 

3.2.3 Data-Dependent Discretization 

In practice, taking the actual data into account usually improves 
the performance of the approximation methods. Thus, it is natu- 
ral to consider also data-dependent discretization techniques. The 
problem of discretizing frequencies by taking the actual frequen- 
cies into account can be formulated as a computational problem as 
follows: 

Problem 3.1 (frequency discretization). Given a finite subset 
P of (0, 1], a maximum error threshold e and a loss function i, find 
a discretization 7 for P such that |7(P)| is minimized and the error 
£(P, 7) is at most e. 

Example 3.8 (frequency discretization). Let the set P C (0, 1] 

consist of points 1/10, 3/10, 7/10 and 9/10, let the maximum error 
threshold e be 1/10, and let the loss function i be the maximum 
absolute error f Equation I3.1jl . 

Then the discretization function 7 with smallest number of dis- 
cretization points and maximum absolute error at most e is the 
mapping 



If e = 1/9 instead, then there are several mappings with the 
maximum absolute error at most e and two discretization points. 
Namely, all mappings 



where a G [1/5 - 1/90,1/5 + 1/90] and 6 G [4/5 - 1/90,4/5 + 1/90]. 
□ 
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In this subsection we derive sub-quadratic algorithms for dis- 
cretizing with respect to the maximum absolute error and polynomial- 
time solutions for also many other classes of loss functions. 

Maximum absolute error 

A discretization of a point set P C (0, 1] without exceeding the 
maximum absolute error e can be interpreted as an interval cover 
of the point set P with intervals of length 2e, i.e., a collection of 
length 2e sub-intervals of [0, 1] that together cover all points in P. 

A simple solution for the frequency discretization problem with 
the loss function being the maximum absolute error is to repeatedly 
choose the minimum uncovered point d € P and discretize all the 
previously uncovered points of P in the interval [d, d + 2e] to the 
value d + e. This is described as Algorithm 13.11 

Algorithm 3.1 A straightforward algorithm for discretization with 
respect to the maximum absolute error. 
Input: A finite set P C [0, 1] and a real value e € [0, 1]. 
Output: A discretization function 7 with ^^(P, 7) < e. 

1: function Interval- Cover(P, e) 

2: while P / do 

3: d ^ min P 

4: I ^ {x £ P : d < X < d + 2e} 

5: for all X € / do 

6: 7(x) ^ d + e 

7: end for 

8: P ^ P\I 

9: end while 
10: return 7 
11: end function 



Theorem 3.9. Alaorithm \3. 1\ finds a discretization function 7 such 
that the error ^^(P, 7) is at most e and for all discretizations 7' 
with a smaller number of discretization points than |7(P)| the error 
£a(P, 7') is greater than e. 

Proof. The maximum absolute error is at most e since all points 
are covered by intervals of length 2e and the distance to the center 
of any covering interval is at most e. 
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To see that a smaller number of discretization points would have 
a larger error, let xi,...,Xm be the discretization points of the 
discretization 7 found by Algorithm Kill for the point set P. By 
construction, there is a point Xj — e £ P for each 1 < i < m. 
Furthermore, \xi — Xj\ > 2e for all discretization points Xi and xj 
of 7 such that 1 < i < j < m, since otherwise the point xj — e (z P 
is contained in the interval [xi — e, + e] or the point Xi — e & P 
is contained in the interval [xj —e,Xj + e]. Thus, no two points 
Xi — €,Xj — e G P such that 1 < i < j < m can share the same 
discretization point Xk where 1 < k < m. □ 

The straightforward implementation of Algorithm 13.11 runs in 
time Od-P] )• The bound is tight in the worst case as shown by 
Example 13.91 

Example 3.9 (The worst case running time of Algorithm 13. 1|) 

Let P = {1/ |P| , 2/ |P| , . . . , 1 - 1/ |P| , 1} and e < 1/ (2 Then 
at each iteration only one point is removed but all other points are 
inspected. There are |P| iterations and the iteration i takes time 
0{i). Thus, the total time complexity is 0(|-P| ). O 

In the special case of e being a constant, the time complexity of 
the algorithm is linear in \P\ because each iteration takes at most 
time 0(|P|) and there can be at most constant number of iterations: 
At each iteration, except possibly the last one, at least length 2e 
subinterval of [0, 1] is covered. Thus, the number of iterations can 
be bounded above by [1/ (2e)] = 0{1) and the total time needed 
isO(|P|). 

The worst case time complexity of the algorithm can be reduced 
to 0(|P|log|P|) by constructing a heap for the point set P. A 
minimum element in the heap can be found in constant time and 
insertions and deletions to the heap can be done in time logarithmic 
in |P| |Knu98j . 

The time complexity 0(|P|log|P|) is not optimal, especially if 
preprocessing of the point set P is allowed. For example, if the 
set P is represented as a sorted array, i.e., an array P such that 
P[i] < P[j] for all 1 < z < J < |P|, then the problem can be solved 
in linear time in |P| by Algorithm 13.21 

The efficiency of Algorithm 13.21 depends crucially on the effi- 
ciency of sorting. In the worst case sorting real-valued points takes 
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Algorithm 3.2 A linear-time algorithm for discretizing a sorted 

point set with respect to maximum absolute error. 

Input: A finite set P C [0, 1] as an array in ascending order and a 

real value e G [0, 1]. 
Output: A discretization function 7 with ^^(P, 7) < e. 

1: function Prefix-Cover(P, e) 

2: for i = 1, . . . , \P\ do 

3: if d < P[i] - e then 

4: P[i] + e 

5: end if 

6: 7(^W) - d 

7: end for 

8; return 7 

9; end function 



time 0(1^1 log |P|) but sometimes, for example when the points 
are almost in order, the points can be sorted faster. For example, 



the frequent itemset mining algorithm ApRiORl |AMS"'"96j finds the 
frequent itemsets in partially descending order in their frequencies. 
Note that also the generalization of the algorithm Apriori, the 
levelwise algorithm (Algorithm 12. 2|) can easily be implemented in 
such a way that it outputs frequent patterns in descending order in 
frequencies. 

However, it is possible to find in time 0(|P|) a discretization 
function with maximum absolute error at most e and the minimum 
number of discretization points, even if the points in P are not 
in ordered in some specific way in advance. This can be done by 
first discretizing the frequencies using the discretization function 7^ 
(Equation 1231) and then repairing the discretization. The high-level 
idea of the algorithm is as follows: 

1. Put the points in P into bins 0, 1 . . . , [1/ (2e)J corresponding 
to intervals (0, 2e] , (2e, 4e] , . . . , (2e [1/ (2e)J , 1]. Let B be the 
set of bins such that B[i] corresponds to bin i. 

2. Find a minimal non-empty bin i m. B. (A non-empty bin i is 
called minimal if i = or the bin i — 1 is empty.) 



3. Find the smallest point x in the bin i, replace the interval 
corresponding to the bin i by interval [x, x -|- 2e] and move 
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the points of the bin i + 1 that are in the interval [x, x + 2e] 
into the bin i. 

4. Remove bin i from B. 

5. Go to step 2 if there are still non-empty bins. 

The algorithm can be implemented to run in linear time in |P|: 
The discretization to bins can be computed in time 0(1^1) using 
a hash table for the set B |Knu98j . A minimal non-empty bin can 
be found in amortized constant time by processing the consecutive 
runs of non-empty bins consecutively. 

If the points in P are given in an arbitrary order, then Algo- 
rithm IIS., SI is asymptotically optimal for minimizing the number of 
discretization points with respect to the given maximum absolute 
discretization error threshold e as shown by Theorem I3.1UI 

Theorem 3.10. No (deterministic) algorithm can find a discretiza- 
tion 7 with the minimum number of discretization points without 
inspecting all points in P C (0, 1] when 2e + 6 < 1 for any 5 > 0. 

Proof. Let P consist of points in the interval (0, 6) and possibly the 
point 1. Furthermore, let the points examined by the algorithm 
be in the interval (0,6). Based on that information, the algorithm 
cannot decide for sure whether or not the point 1 is in P. □ 

If the set P is given in ascending or descending order, how- 
ever, then it is possible to find a set 7(P) of discretization points 
of minimum cardinality among those that determine a discretiza- 
tion of P with the maximum absolute error at most e, in time 
0{\'y{P)\ log see Algorithm K14I Although "fiP) is only an im- 
plicit representation of the discretization function 7 : P — > j{P), 
the discretization of any x E P can be found in time ©(log \^{P)\) 
if the set 7(-P) is represented, e.g., sorted array. 

Note that the proposed techniques for discretizing with respect 
to the maximum absolute error guarantees (i.e., Algorithms 13.11 
13.21 13.31 and 13.4(1 generalize to maximum error functions that are 
strictly increasing transformations of the maximum absolute error 
function. Furthermore, the algorithms can be modified to minimize 
the maximum absolute error instead of the number of discretization 
points by a simple application of binary search. 
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Algorithm 3.3 A lineax-time algorithm for discretization with re- 
spect to maximum absolute error. 

Input: A finite set P C [0, 1] and a real value e e [0, 1]. 
Output: A discretization function 7 with £a(P, 7) < e. 
1: function Bin-Cover(P, e) 
for all a; G P do 
i^[x/{2e)\ 
B[i] ^B[^]V^ {x} 
end for 

for all B[i] G B,B[i\ 7^ do 

while i > and B[i - 1] 7^ do 
i <— z — 1 
d <— min B \i\ 
end while 
while B[i]i^$ do 

/ ^ {z e B[i] ■.d<x<d + 2e} 
for all X € / do 
7(x) <— d + e 
end for 
B[{\^B[i\\I 

if m.mB[i + 1] < d + 2e then 

i^i + l 
else 

d minS[z] 
end if 
end while 
end for 
return 7 
end function 



Weighted sums of errors 

Sometimes it would be more natural to valuate the quality of dis- 
cretizations using a weighted sum 

^ w{x)t{x,-i) 
xeP 

of errors i{x,^) instead of the maximum error maxj;gp^(x,7). In 
that case, the algorithms described previously in this chapter do not 
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Algorithm 3.4 A sublinear-time algorithm for discretization a 
sorted point set with respect to the maximum absolute error. 
Input: A finite set PC [0, f] as an array in ascending order and a 

real value e G [0,1]. 
Output: A discretization points 'y{P) with la{P,l) ^ £• 
1: function Log-Cover(P, e) 
i ^ 1 

while i < \P\ do 
d ^ P\i] + e 
j{P)^j{P)U{d} 
j ^ |P| + 1 
while j > i + 1 do 
fe^L(i + i)/2j 
if P[k] <d + e then 

i <— /c 
else 

j ^ k 
end if 
end while 
i ^ j 
end while 
return 7(P) 
end function 



find the optimal solutions. Fortunately, the problem can be solved 
optimally in time polynomial in \P\ by dynamic programming; see 




To describe the solution, we have to first define some notation. 
Let the point set P be represented as an array in ascending order, 
i.e., P[i] < P\j] for all 1 < i < j < and let P[i,j] denote the 
subarray P[i] . . . P[j]- The best discretization point to represent the 
array P[i,j] is denoted by Hij and its error by Sij. The loss of the 
best discretization P[l,«] with k discretization points with respect 
to the sum of errors is denoted by and the k — 1th discretization 
point in that discretization is denoted by uj^. 

The optimal error for P[l, i] using k discretization points can be 
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defined by the following recursive formula: 




minfc<j<j |a^_J + ej^ij 



if A; = 1 and 



other 



■wise. 



The optimal sum-of-errors discretization by dynamic program- 
ming can be divided into two subtasks: 

1. Compute the matrices ^ of discretization points and e of their 
errors: /ijj- is the discretization point for the subset P[i,i] and 
Ei^j is its error. 

2. Find the optimal discretizations for -P[l,i] with k discretiza- 
tion points for all 1 < k < i < \P\ from the matrices fi and e 
using dynamic programming. 

The optimal discretization function for P can be found from 
any matrix e £ rI^I^I^I of errors and any matrix fi € rI^I^I^I of 
discretization points (although not all matrices e and fi make sense 
nor are they computable). For example, the matrices can be given 
by an expert. 

Simple examples of error and discretization point matrices com- 
putable in polynomial time in |P| are the matrices e and fi for 
the weighted sums of absolute errors. They can be computed in 
time 0(|P|^) as described by Algorithm \'A.^\ (Function Median 
computes the weighted median of P[i,j].) 

The discretization points ^ij and the errors e^j of P[i,j] for all 
^ ^ i ^ j ^ \P\ can already be informative summaries of the set P. 
Besides of that, it is possible to extract from the matrices e and /i 
the matrices A and uj corresponding to the partial sums of errors 
and the discretizations. This can be done by Algorithm 13.61 (The 
matrices A and lv determine the optimal discretizations for each 
number of discretization points and each prefix -P[l,i] of P.) 

The time complexity of AlgorithmlSISlis 0(1^1 ). The time con- 
sumption can be reduced to 0{k \P\ ) if we are interested only on 
discretizations with at most k discretization points. Furthermore, 
the method can be adapted to other kinds of loss functions, too. For 
some loss functions, the dynamic programming can be implemented 
with asymptotically better efficiency guarantees |EK01l DkM"'"98] . 
There are several ways to speed up the search in practice. For ex- 
ample, it is not necessary to compute the parts of the matrices that 
are detected to be not needed in the best solutions. 
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Algorithm 3.5 An algorithm to compute the loss and discretiza- 
tion matrices e and /u for the point set P and a weight function w. 

Input: A finite set P C [0, 1] and a weight function ui : P — > M. 
Output: Matrices £ and fi. 
1: function Valuate-Abs(P, w) 
for i = 1, . . . ,\P\ do 
for j = i, . . . , \P\ do 

Hi J ^ MEDIAN(P[i,j], w) 



2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10 

11 

12: 



for k = i, . . . ,j do 



end for 
end for 
end for 
return (e, fi) 
end function 



Although the matrices A and uj contain the information about 
the optimal discretizations of all prefixes of P for each number 
of discretization points, usually the actual goal is to extract the 
optimal discretizations from these matrices. 

The optimal discretizations of k discretization points can be 
found in time 0(1^1) by Algorithm IH.7I It can be adapted to find 
discretization with minimum number of discretization points and 
the error less than e in time linear in Note that if it is suffi- 
cient to obtain just the set 'y{P) of k discretization points, then the 
task can be conducted in time 0{k) by Algorithm 13.81 

Instead of finding the best discretization with a certain num- 
ber of discretization points, one could search for a hierarchical 
discretization suggesting a good discretization of k discretization 
points for all values of k. 

Example 3.10 (hierarchical discretizations). Let the point set 
P be {0.1, 0.2, 0.5, 0.6, 0.9, 1.0} and let us consider hierarchical dis- 
cretizations with respect to the maximum absolute error. Two stan- 
dard approaches to define hierarchical clusterings are divisive (or 
top-down) and agglomerative (or bottom-up) clusterings. 

Divisive hierarchical clustering starts from the whole point set 
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Algorithm 3.6 An algorithm to compute matrices A and u from 
P, £ and /X. 

Input: A finite set P C [0, 1], and matrices e and /x. 
Output: Matrices A and uj. 
1: function Tabulator(P, e, ji) 

for alH G {1, . . . , |P|} do > Initialize the errors A^. 

end for 

for all k,i G {2, . . . , |P|} , fc < i do 

Af ^ oo 
end for 

for /c = 1, . . . , |P| do Find the best discretization of 
with k discretization points. 

for all j, z G {A;, . . . , |P|} , j < z do 
if A' < Af then 

A^' ^ A' 



;f ^ j - 1 



end if 
end for 
end for 
return (A,ti;) 
end function 



and recursively divides it in such a way that the division always 
improves the solution as much as possible. For example, the divisive 
clustering of P would be the following: 

• The first level of the clustering consists of only one cluster, 
namely {0.1, 0.2, 0.5, 0.6, 0.9, 1.0}. 

• The maximum absolute error is decreased as much as pos- 
sible by splitting the set into two parts {0.1,0.2,0.5} and 
{0.6,0.9,1.0} 

• In the third level no split improves the maximum absolute er- 
ror. However, splitting {0.1,0.2,0.5} to {0.1,0.2} and {0.5}, 
or splitting {0.6,0.9,1.0} to {0.6} and {0.9,1.0} decreases 
most the maximum absolute error for one of the clusters with 
the maximum absolute error. 
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Algorithm 3.7 An algorithm to extract the best discretization of 

k discretization points from the matrices A and u. 

Input: A finite set P C [0, 1], matrices A, fi and uj, and an integer 

ke{l,...,\P\}. 

Output: The discretization 7 of fc discretization points with the 
smallest error A|^p|. 
1: function FlND-DlSCRETIZATION(P, /J,, UJ, k) 
2: i*-\P\ 
3: for I = k,. . . ,1 do 
4: for j = i, . . . + 1 do 

5: j{P[l\) ^ H^j . 

6: end for 

7: i ^ '^i 

8: end for 
9: return 7 

10: end function 

Algorithm 3.8 An algorithm to extract the best k discretization 
points from the matrices A and uj. 

Input: A finite set P C [0, 1], matrices A, fi and uj, and an integer 

A;G{1,|P|}. 

Output: The set "y{P) of k discretization points with points with 
the smallest error Aj^p|. 
1: function FlND-DlSCRETIZATION-POINTS(P,/X,a;,A;) 
2: 7(P) ^ 
3: «^|P| 
4: for / = A;, . . . , 1 do 
5: + 1 

6: 7(^) ^ l{P) U 

7: i ^ Uj\ 

8: end for 
9: return 7(P) 

10: end function 



• The fourth level consists of the clusters {0.1, 0.2}, {0.5}, {0.6}, 
and {0.9,1.0}. 

• In the fifth level we have again two equally good sphtting 
possibilities: {0.1,0.2} to {0.1} and {0.2}, or {0.9,1.0} to 
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{0.9} and {1.0}. 

• The last level consists of singletons {0.1}, {0.2}, {0.5}, {0.6}, 
{0.9}, and {1.0}. 

Agglomerative hierarchical clustering starts from the singletons 
and merges the clusters by minimizing the error introduced by the 
merges. Thus, the agglomerative clustering of would be the fol- 
lowing: First level consists of singletons {0.1}, {0.2}, {0.5}, {0.6}, 
{0.9}, and {1.0}. In the next three levels {0.1} and {0.2}, {0.5} and 
{0.6}, and {0.9} and {1.0} are merged in some order. Thus, the 
level four consists of clusters {0.1,0.2}, {0.5,0.6}, and {0.9, 1.0}. In 
the level five either {0.1,0.2} is merged with {0.5,0.6}, or {0.5,0.6} 
is merged with {0.9, 1.0}. The last level consists of the set P. 

It depends on the actual use of the discretized values which one 
of these two approaches to hierarchical clustering is better. □ 

In addition to standard divisive and agglomerative hierarchical 
discretizations, it is possible to find hierarchical discretizations that 
are optimal with respect to a given permutation vr : {1, . . . , |P|} — > 
{1,...,|P|} in the following sense: The discretization with vr(l) 
discretization points has the minimum error among all discretiza- 
tions with 7r(l) discretization points. The discretization with 7r(2) 
discretization points is the one that has the minimum error among 
all discretizations compatible with the discretization with vr(l) dis- 
cretization points. In general, the discretization with 7r(i) dis- 
cretization points has the minimum error among the discretizations 
with 7r(i) discretization points that are compatible with the chosen 
discretizations with 7r(l), 7r(2), . . . , 7r(i — 1) discretization points. 

The time complexity of the straightforward dynamic program- 
ming implementation of this idea by modifying Algorithm 13.61 is 
0(|P|^). Furthermore, for certain loss functions it is possible to 
construct hierarchical discretizations that are close to optimal for all 
values of the number of discretization points simultaneously |Das02j . 

The discretizations could be applied to association rules instead 
of frequent patterns. In that case, there are two values to be 
discretized for each association rule: the frequency and the ac- 
curacy of the rule. This can be generalized for patterns with d- 
dimensional vectors of quality values. The problem is equivalent 
to clustering, and thus in general, the problem is NP-hard but 
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many known approximation algorithms for clustering can be ap- 
plied IBHPI02. mVKKmm IfHSSI lKMN+n4l IK?^ IK V Vn4| . 

3.3 Condensation by Discretization 

Discretization of frequencies can be used to simplify the collections 
of frequent patterns. The high-level schema is the following: 

1. Discretize the frequencies of the frequent patterns. 

2. Find a condensed representation for the pattern collection 
with the discretized frequencies. 

For example, the collection of closed frequent itemsets can be 
approximated by the closed frequent itemsets with respect to dis- 
cretized frequencies. 

Example 3.11 (condensation by discretization and closed 
itemsets). Let I = {I, . . . , [{1 — a)n\ }, cr G (0, 1) and 



TC{a,V) = {{l},...,{n},I} 
with fr{I, V) = \a \VW / \V\ and fr{{A} ,V) = \a \V\ + \V\ for 



each A ^Z. 

If we allow error 1/ jl^l in the frequencies, then we can discretize 
all frequencies of the non-empty cr-frequent closed itemsets in D to 
\a \VW I i.e., (7o/r)(0, V) = 1 and (7o/r)(X, V) = \a \V\] / \V\ 
for all other X QZ. 

Then the collection TC{a,'D,'y) of cr-frequent closed itemsets 
with respect to the discretization 7 consists only of two itemsets 
and Z with frequencies 1 and [cr / □ 

Note that if the original transaction database is available, then 
a slightly similar approach to condense the collection of frequent 
itemsets is to take a random sample of the transactions and com- 
pute the closed frequent itemsets in the sample. This reduces the 



V = 



{(l,{l}),...,(L(l-a)nJ,{L(l- 
U{{[{l-a)n\+i,Z):iG{l,.. 



a)n\})} 
,\an]}}. 



Then 
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number of closed itemsets but still results relatively good approxi- 
mation for the frequencies of the frequent itemsets |Mien4c| IToi96j . 
A major advantage of computing the closed frequent itemsets in a 
sample of transactions is that computing the closed frequent item- 
sets in the sample is potentially much faster than computing the 
collection of (closed) itemsets in the original data and discretizing 
the frequencies. Disadvantages of this sampling approach are that 
the outcome of the closed itemset mining from the sample is also a 
random variable depending on the sample, and that the quality of 
the approximation provided by the closed frequent itemsets in the 
sample is at most as good as the quality of the optimal approxima- 
tion provided by the optimal discretization of the frequencies. Of 
course, the sampling and discretization could be used in conjunc- 
tion, by first taking a relatively large sample of transactions for ob- 
taining the closed frequent itemsets efficiently and then discretizing 
the frequencies of the closed frequent itemsets in the sample. This 
should provide the computational efficiency and the approximation 
quality in between of sampling and discretizing. We focus, however, 
solely on discretizations. 

Example 3.12 (closed itemsets disappearing in the course 
completion database). Let us consider the collection J^C{cr, T>) of 
closed 0.20-frequent itemsets in the course completion database (see 
Subsection l2.2.l)) . Recall (Example l2.9|) that the number \TC{a, T))] 
of the closed 0.20-frequent itemsets in the course completion data- 
base is 2136. 

If the supports are discretized with the maximum absolute error 
2 (that is less than 0.1 percent of the number of transactions in 
the database), then the number of closed itemsets with respect to 
the discretized supports is only 567, i.e., less than 24 percent of 
|^C(a,P)|. 

In some parts of the itemset collection J^C{a,'D) the reduction 
in the number of the closed itemsets can be even greater than the 
average. For example, there are eight subsets of the itemset X = 
{3, 5, 7, 13, 14, 15, 20} than are closed with respect to exact supports 
but that have the same discretized support as X. These itemsets 
are shown in Table IXTl 

□ 
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Table 3.1: The itemsets with the same discretized support as 
{3, 5, 7, 13, 14, 15, 20} in the course completion database. The col- 
umn are as follows: supp{X,'D) is the exact support of the itemset 
X in the completion database P, 'y'^{supp{X,'D)) is supp{X,'D) dis- 
cretized with maximum absolute error 2, and X € J-C{a,'D) is the 
it emset X. 



supp{X, T)) 


-il{supp{X,V)) 


X G J'Cia.V) 


488 


490 


{3,5,7, 13,14,15,20} 


489 


490 


{3,5,7,13,15,20} 


489 


490 


{3,5,13,14,15,20} 


490 


490 


{3,5,13,15,20} 


490 


490 


{3,7,13,14,15,20} 


491 


490 


{3,13,14,15,20} 


492 


490 


{3,7,13,15,20} 


492 


490 


{5,7,13,14,15,20} 



The number of discretization points determines the quality of 
the approximation: On one extreme — a discretization using only 
one discretization point — the frequent itemsets that are closed 
with respect to the discretized frequencies correspond to maximal 
frequent itemsets. When the number of discretization points in- 
creases, also the number of closed frequent itemsets increase, the 
other extreme case being the collection of frequent closed itemsets 
without any discretization. 

If the condensed representation depends on testing whether the 
frequencies of the patterns are equal (such condensed representa- 
tions are, for example, the closed and the free patterns), then the 
number of discretization points can be used as an estimate of the 
effectiveness of the discretization. In addition to simplifying the 
collections of frequent patterns, discretization can be used to make 
the discovery of some patterns more efficient. 

We evaluated the condensation abilities of discretizations by dis- 
cretizing the frequencies of the frequent itemsets in the Internet 
Usage and IPUMS Census databases (see Subsection 12.2. 1() . and 
then computing which of the frequent itemsets are closed also with 
respect to the discretized frequencies. (In these experiments, we 
omitted the empty itemset from the itemset collections since its 
frequency is always 1.) 
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In the first series of experiments we were interested whether data- 
dependent discretizations yield to smaller collections of closed item- 
sets than their data- independent counterparts. We discretized the 
frequencies using discretization function 7^ fEauation 13.41) and the 
algorithm Prefix-Cover (Algorithm 13. 2|) with different maximum 
absolute error thresholds e and removed the itemsets that were not 
closed with respect to the discretized frequencies. 

The results for Internet Usage database with the minimum fre- 
quency threshold 0.05 are shown in Table 021 The number of the 
0.05-frequent itemsets, the number of the closed 0.05-frequent item- 
sets and the number of the maximal 0.05-frequent itemsets in In- 
ternet Usage database are 143391, 141568, and 23441, respectively. 

Table 3.2: The number of closed itemsets in the collection of 0.05- 
frequent itemsets in the Internet Usage database with discretized 
frequencies for different maximum absolute error guarantees. The 
columns of the table are the maximum absolute error e allowed, 
the number of cj-frequent itemsets that are closed with respect to 
the frequencies discretized using Equation 13.41 and the number of 
(T-frequent itemsets that are closed with respect to the frequencies 
discretized using Algorithm 13.21 



e 


fixed discretization 


empirical discretization 


0.0010 


123426 


123104 


0.0050 


72211 


71765 


0.0100 


54489 


45944 


0.0200 


34536 


31836 


0.0400 


31587 


25845 


0.0600 


26087 


24399 


0.0800 


24479 


23916 


0.1000 


23960 


23705 



The results for IPUMS Census database with the minimum fre- 
quency threshold 0.2 are shown in Table 13.31 The results were 
similar to other minimum frequency thresholds. The number of 
the 0.2-frequent itemsets, the number of the 0.2 frequent closed 
itemsets and the number of the maximal 0.2-frequent itemsets in 
IPUMS Census database are 86879, 6689, and 578, respectively. 

Clearly, the number of closed a-frequent itemsets is an upper 
bound and the number of maximal cr-frequent itemsets is a lower 
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Table 3.3: The number of closed itemsets in the collection of 0.2- 
frequent itemsets in the IPUMS Census database with discretized 
frequencies for different maximum absolute error guarantees. The 
columns of the table have the same interpretation as the columns 



e 


fixed discretization 


empirical discretization 


0.0010 


3226 


3242 


0.0050 


2362 


2375 


0.0100 


1776 


1772 


0.0200 


1223 


1225 


0.0400 


1014 


841 


0.0600 


932 


725 


0.0800 


711 


661 


0.1000 


627 


627 



bound for the number of frequent itemsets that are closed with re- 
spect to the discretized frequencies. The maximum absolute error 
is minimized in the case of just one discretization point by choosing 
its value to be the average of the maximum and the minimum fre- 
quencies. The maximum absolute error for the best discretization 
with only one discretization point for the 0.05-frequent itemsets in 
the Internet Usage database is 0.4261. This is due to the fact that 
the highest frequency in the collection of the 0.05-frequent item- 
sets in the Internet Usage database (excluding the empty itemset) 
is 0.9022. The maximum absolute error for the best discretization 
with one discretization point for the 0.2-frequent itemsets in the 
IPUMS Census database is 0.4000. That is, there is an itemset 
with frequency equal to 0.2 and an itemset with frequency equal to 
1 in the collection of 0.2-frequent itemsets in the IPUMS Census 
database. 

In addition to minimizing the maximum absolute error, we com- 
puted the optimal discretizations with respect to the average ab- 
solute error using dynamic programming (Algorithms 13.51 13.61 and 
13.7(1 . In particular, we computed the optimal discretizations for 
each possible number of discretization points. The practical feasi- 
bility of the dynamic programming discretization depends crucially 
on the number N = \fr{J^{a,'D),T>)\ of different frequencies as its 
time complexity is 0{N'^). Thus, the tests were conducted using 
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smaller collections of frequent itemsets than in the case of discretiza- 
tion with respect to the maximum absolute error. 

For the average absolute error, a uniform weighting over the 
frequent itemsets were used. That is, the error of the discretization 
7 was 



The results are shown in Figure ITT] and in Figure EIH The fig- 
ures can be interpreted as follows. The labels of the curves are the 
minimum frequency thresholds a for the collections of cr-frequent 
itemsets they correspond to. The upper figures show the number 
of frequent itemsets that are closed with respect to the discretized 
frequencies against the average absolute error of the discretization. 
The lower figures show the number of the closed cr-frequent item- 
sets for discretized frequencies against the number of discretization 
points. 

On the whole, the results are encouraging, especially as the dis- 
cretizations do not exploit directly the structure of the pattern col- 
lection but only the frequencies. Although there are differences 
between the results on different databases, it is possible to observe 
that even with a quite small number of closed frequent itemsets and 
discretization points, the frequencies of the frequent itemsets were 
approximated adequately. 
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Figure 3.1: The best average absolute error discretizations for In- 
ternet Usage data. 
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Figure 3.2: The best average absolute error discretizations for 
IPUMS Census data. 



CHAPTER 4 



Trade-offs between Size and 
Accuracy 

There are trade-offs between the understandabihty of the pattern 
collection and its ability to describe the data at hand: 

• If the pattern collection is small, then there is a chance that 
it could eventually be understandable. 

• If the pattern collection is large, then it might describe the 
data underlying the pattern collection adequately. 

Sometimes a very small collection of patterns can be both under- 
standable and accurate description the data. In general, however, 
characterizing the data accurately requires many patterns assuming 
that there are many different relevant data sets. 

The trade-offs between understandabihty and accuracy have been 
studied in pattern discovery mainly by comparing the cardinality 
of the pattern collection to a quantitative measure of how well the 
pattern collection describes the relevant aspects of the data. 

Typically, one obtains smaller pattern collections by using suffi- 
ciently high minimum quality value thresholds. Finding a minimum 
quality value threshold that captures most of the interesting and 
only few uninteresting patterns is a challenging or even impossible 
task in practice. 

Example 4.1 (discovering the backbone of a supermarket's 
profit). Let X> be a transaction database of a supermarket contain- 
ing purchase events, the set X of items consisting of the products 
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sold in that supermarket. Furthermore, let w{A) be the profit of 
the item A £ I and let w^X) be the combined profit of the items 
in the itemset X, i.e., w{X) = J^Aex wiA). 

Suppose that we are interested in itemsets that fetch large por- 
tions of the profit of the supermarket, i.e., the itemsets X C X with 
large (weighted) area w{X)supp{X,T)) in the transaction database 
T>. Then we have to face several problems, for example the following 
ones. 

First, there is no way to define a minimum frequency thresh- 
old that would capture most of the itemsets fetching a large profit 
without discovering many itemsets with less relevancy to the total 
profit of the supermarket, although the support of the itemset is 
the only data-dependent part of this interestingness measure. 

Second, it is not clear what would be the right minimum area 
threshold. For example, why should we choose 10000 instead of 
9999 or vice versa? Intuitively this should not matter. However, 
even a small change in the threshold might change the collection of 
interesting patterns considerably. 

Third, we could find out that we are actually more interested in 
some other kinds of products, e.g., products with character and a 
weak brand. However, even realizing that these constraints are im- 
portant for a itemset being interesting might be very difficult from a 
large collection of itemsets that contain also sufficiently many such 
itemsets. Furthermore, weak brand can perhaps be detected based 
on the discrepancy between the market value and the production 
costs of the product but determining that a product has character 
is highly subjective task without, e.g., extensive customer polls. 

Thus, finding truly interesting patterns from data is often a chal- 
lenging, iterative and interactive process. □ 

To reduce the discrepancy between the size and the accuracy, 
several condensed representations of pattern collections have been 
introduced. (See Section lTH for more details.) They share, however, 
the same fundamental difficulties as the other pattern collections: 
it is difficult to find a small pattern collection that summarizes 
(the relevant aspects of) the data well. Overcoming these problems 
with the size and the accuracy seems to be very difficult and they 
give rise also to a crisp need for interactive exploration of pattern 
collections and the trade-offs between the size and the accuracy. 
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If the whole pattern collection is too huge to comprehend, then 
a natural solution to this problem is to consider only a subcollec- 
tion of patterns. There are a few properties that a good subcollec- 
tion of a pattern collection should fulfill. First, the subcollection 
should be representative for the whole pattern collection. (This re- 
quirement is based on the assumption that if the pattern collection 
describes the data well, then also the representative subcollection 
should describe the data quite well. The reason why the require- 
ment is not defined directly for the data, instead of the patterns, is 
that the data might not be available or accessing it might be very 
expensive. Furthermore, the methods described in this chapter can 
readily be adapted for measuring the quality of the subcollection 
using the data instead of the patterns.) Second, the representa- 
tive subcollection of k patterns should not differ very much from 
the representative subcollections of A; -|- 1 and k — 1 patterns, i.e., 
the representative subcollections should support interactive mining 
as it is presumably highly non-trivial to guess the right value of k 
immediately. 

In this chapter we propose, as a solution to this problem, to order 
the patterns in such a way that the A;th pattern in the ordering 
improves our estimate of the quality values of the patterns as much 
as possible, given also the k — 1 previous patterns in the ordering. 
Note that this ensures that the representative subcollection of k 
patterns does not differ much from the representative subcollections 
of A; -I- 1 and k — 1 patterns. 

In addition to the pattern ordering problem, wc study also the 
problem of choosing the best /c-sub collection of the patterns. We 
show that this problem is NP-hard in general. However, for cer- 
tain loss functions and estimation methods, the optimal pattern 
ordering provides a constant factor approximation for the best k- 
subcollections for all values of k simultaneously. That is, each 
length-/c prefix of the ordering is a subcollection that describes the 
quality values of the patterns almost as well as the best subcollec- 
tion of cardinality k. 

The feasibility of the method depends strongly on the loss func- 
tion and the estimation method at hand. To exemplify this, we 
describe concrete instantiations of pattern orderings in two cases. 
First, we use the pattern ordering to provide a refining representa- 
tion of frequent patterns. The representation is based on estimating 
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the frequencies of the patterns by the maximum of the frequencies 
of the known superpatterns. Any prefix of this pattern ordering 
can be seen as an approximation of closed frequent patterns. Sec- 
ond, we show how transaction databases can be described as tihngs. 
(Tihng is a collection of tiles. A tile consists of an itemset and a set 
of transaction identifiers of transactions that contain the itemset. 
We use the fraction of the items of the database covered by the tiles 
as the quality of the tiling.) 

Finally, we empirically evaluate the suitability of pattern order- 
ings to serve as condensed representations in the case of the fre- 
quent itemsets. More specifically, we estimate the frequencies of 
the frequent itemsets using the maximum frequencies of the known 
superitemsets and measure the loss by the average of the absolute 
differences between the correct and the estimated frequencies. 

This chapter is based mostly on the article "The Pattern Order- 
ing Problem" |MM03j . (The example of tilings described in Sec- 
tion|331is related also to the article "Tiling Databases" ISnMOlj.) 

4.1 The Pattern Ordering Problem 

Most condensed representations of pattern collections consist of a 
subcollection of the patterns such that the subcollection represents 
the whole pattern collection well, often exactly. Representing the 
whole pattern collection well by its subcollection depends on two 
components. 

First, it depends on a function for estimating the quality val- 
ues of the patterns from the quality values of the patterns in its 
subcollection, i.e., an estimation method 

^ : ]JVx[0,lf ^ [0, 1] . 

scv 

Example 4.2 (frequency estimation). A simple estimate for 
the frequency fr{X, D) of an itemset X QZ is 

^i,5(^,M5)='^i^\''i n ^(^'^) (4-1) 

where S Q 2^ and 6 is the default frequency for the items whose 
frequencies are not known. This estimation method assumes the 
independence of the items. 
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The downside of this estimation method is that it does not make 
use of the other frequencies than the frequencies of the singleton 
itemsets. Fortunately it can be generalized to exploit also other 
frequencies. The idea of the generalization is to find a probabil- 
ity distribution over the itemsets in the transactions that has the 
maximum entropy among the probability distributions compatible 
with the frequency constraints. This estimation method has been 
applied successfully in estimating the frequencies of itemsets based 
on the frequencies of some other itemsets |PMSn3j . □ 

Example 4.3 (frequency estimation). Another simple frequency 
estimate is 

^MaAX,fr\s) = max {/r(y,P) :YeS,Y^X}. (4.2) 

Note that in the case of the closed itemsets fPefinition I2.iup . the 
frequencies of the non-closed itemsets are obtained using this rule. 
□ 

Second, the estimation is evaluated by a function that measures 
the error of the estimation, i.e., a loss function 

e : [0, if X [0, if M. 

Example 4.4 (Lp norms). One popular class of loss functions are 
Lp norms 

^L,(0,^(-,</'|5))= ('j;|(/>(x)-V(x,0|5)r) . (4.3) 

For example, if p = 2 then the Lp norm is the euclidean distance, 
and if p = 1, then it is the sum of absolute errors. The case where 
p = oo corresponds to the maximum of the absolute errors. □ 

To simplify the considerations, we consider estimation methods 
and loss functions as oracles, i.e., as functions that can be evaluated 
in constant time regardless of their true computational complexity 
or even computability. (Although the loss functions are often com- 
putable in a reasonable time, there is not a necessity for that restric- 
tion in the context of this chapter.) With the aid of the estimation 
method and the loss function, we can formulate the problem of find- 
ing a subcollection that represents the whole pattern collection well 
as follows: 
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Problem 4.1 (the best fc-subcoUection patterns). Given a 
pattern collection V, an interestingness measure (j), a positive in- 
teger k, an estimation function -0, and a loss function i, find the 
best /c-sub collection S of V, i.e., find a collection 5 C P such that 
\S\ = k and for all /c-subcollections S' of P hold 

i{(l),^l^{-A\s))<i{<P,H-A\s'))- 

That is, 5 has the smallest error among all /c-subcollections of V. 

The problem of finding the best /c-subcollection of patterns de- 
pends on five parameters: the pattern collection V, the interesting- 
ness measure (p, the estimation method ip, the loss function i and 
the number k of patterns allowed in the subcollection. 

Example 4.5 (the best fc-subcoUection itemsets). Combining 
Example 14.31 and Example 14.41 we get one instance of Problem 14.11 

• The pattern collection 7^ is a subcollection of the collection 
2^ of all subsets of the set I of items. For example, V could 
be the collection T{a, V) of the c-frequent itemsets in a given 
transaction database V. 

• The interestingness measure (j) is the frequency in the trans- 
action database T> and it is defined for all itemsets in the 
pattern collection. 

• The frequencies of the itemsets are estimated by taking the 
maximum of the frequencies of known superitemsets of the 
itemset whose frequency is under estimation, i.e., the estima- 
tion method ip is as defined by Equation 14.21 

• The loss in the estimation is measured by the maximum of 
the absolute differences between the estimated and the cor- 
rect frequencies of the itemsets in V. That corresponds to 
Equation 14.31 with p = oo. 

These parameters together with the number k (the maximum num- 
ber of patterns) form an instance of Problem 14.11 □ 

Problem 14.11 is an optimization problem. It can easily be trans- 
formed to a decision problem that asks whether there exists a k- 
subcollection of V with the error at most e instead of looking for 
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the /c-subcollection with the smaUest error. Unfortunately, even a 
simple special case of the problem — the decision version of Exam- 
ple ESI — is NP-complete as shown by Theorem 14.11 

We show the NP-hardness of Problem 14.11 by reduction from 
Problem 14.21 which is known to be NP-complete |(jrJ79j . (For more 
details in complexity theory and NP-completeness, see |GJ79iPap95| .) 

Problem 4.2 (minimum cover jG.T79| ). Given a collection C 
of subsets of a finite set S and a positive integer k, decide whether 
or not C contains a cover of S of size k, i.e., whether or not there is 
a subset C ^ C with |C"| = k such that every element of S belongs 
to at least one member of C . 

Note that we omit the empty itemset from the collection J- {a, V) 
in the proof of Theorem l4.1l to simplify the reduction, since /r(0, "D) = 
1, i.e., it is never necessary to estimate fr{^,T)). 

Theorem 4.1. Given a collection J-{a,'D) of cr- frequent itemsets 
in a transaction database T>, a maximum error bound e and a cardi- 
nality bound k, it is ^V-complete to decide whether or not there is 
a subcollection of J^{a,T>y with the cardinality k such that the maxi- 
mum absolute error between the correct frequency and the maximum 
of the frequencies of the superitemsets in the subcollection is at most 
e. That is, it is NP-complete to decide whether or not there is a 
collection J^(a,T>y C T{a,T>) such that \T{(t,T>)'\ = k and 

^ inax^^ {fr{X, V) - max {/r(y, V) : X e T[a, P)'} } < e. 

Proof. The problem is in NP since we can check in time polynomial 
in the sum of the cardinalities of the itemsets in T{(j^ T>) whether or 
not the maximum absolute error is at most e for all X G J'icr, T>) . 
That is, we can check in polynomial time for each X E J^{o', T>) and 
for each Y € J^{o', V)' such that X C y whether 

|/r(X, V) - fr{Y, V) \ = fr{X, V) - fr{Y, V) < e. 

Let us now describe the reduction from Problem l4.2l It is easy to 
see that we can assume for each instance {S,C,k) of Problem 14.21 
that each element of S is contained at least in one set in C, no 
set in C is contained in another set in C, and the cardinality of 
each set in C is greater than one. Furthermore, we assume that 
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the cardinalities of the sets in C are ah at most three; the problem 
remains NP-complete |(T.T79j . 

An instance (C, S, k) of the minimum cover problem is reduced 
to an instance {J^{a,'D),fr,ip,£,k,e) as follows. The set I of items 
is equal to the set S. The pattern collection T{a,'D) consists of 
the sets in C and all their non-empty subsets. Thus, the cardi- 
nality of T{(T,V) is 0{\S\ ), since we assumed that the cardinality 
of the largest set in C is three. The transaction database T> con- 
sists of one transaction for each set in C and an appropriate num- 
ber of transactions that are singleton subsets of S to ensure that 
fr{{A},V) = fr{{B},V) > fr{X,V) for ah A,B (£ S and X e C. 
(Thus, the minimum frequency threshold a is 

If we set e = fr{{A} ,2?) — for any element A in S, then 

there is a set C <Z C such that \C'\ = k if and only if 

^ max^^ {fr{X, V) - max {fr{Y, V) : X (ZY ^ F{a, V)']] < e 

holds for the same collection C = J-{(t,T>)' C J^[(t,T>) with respect 
to the transaction database T). (Note that without loss of gener- 
ality, we can assume that T^ajV)' C J^M.(a,V) = C.) Thus, the 
problem is NP-hard, too. □ 

Thus, the decision version of the special case of Problem 14.11 as 
described by Example l4.5l is NP-complete by Theorem l4.1l Thus, so 
is Problem 14. II itself. (For an alternative example of such a special 
case of Problem 14. II shown to be NP-complete, see |AGM04l .) 

Furthermore, the proof of Theorem 14.11 implies also the following 
inapproximability result for the optimization version of the prob- 
lem. (For more details in approximability, see |ACK"'"99] .) 

Theorem 4.2. Given a collection V of itemsets in a transaction 
database V, their frequencies, the estimation method ipMax de- 
fined by Equation \4.^ and loss function ^Loo (<^i (Al-s)) defined 
by Equation \4-'A it is NP -/lard to find a subcollection S of V such 
that 

h^{fr\v,i^Max{-,fr\s)) < e 

and the cardinality of S being within a factor clogjX| (for some 
constant c > 0) from the cardinality of the smallest subcollection of 
V with error at most e. 
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Proof. The reduction in the proof of Theorem 14.11 shows that the 
problem is APX-hard |ACK+99l IPYOT] . 

If the collection J^(cj, V) is replaced by the collection V = C U 
{{A} : A € S}, then we can get rid of the cardinality constraints for 
the sets in C while still maintaining the itemset collection V being 
of polynomial size in the size of the input (C, S, k) . This gives us 
stronger inapproximability results. Namely, it is NP-hard to find 
a set cover C" C C of the cardinality within a logarithmic factor 
clog 1 51 (for some constant c > 0) from the smallest set cover of S 
inC jACKfUlESlZl. 

If we could find a collection 5 C P of the cardinality k and the 
error at most e, then that collection could also be a set cover of S 
of the cardinality k. □ 

Even if there was a polynomial-time solution for Problem 14.11 it 
is not clear whether it is the right problem to solve after all. A major 
disadvantage of the problem is that it does not take into account 
the requirement that the solution consisting of k patterns should 
be close to the solutions consisting of A; + 1 and k — 1 patterns. In 
general, it would be desirable that the solutions of all cardinalities 
would be somewhat similar. 

One approach to ensure that is to order the patterns somehow 
and consider each length-A; prefix of the ordering as the represen- 
tative A;-sub collection of the patterns. The ordering should be such 
that the prefixes of the ordering are good representative subcollec- 
tions of (the quality values of) the pattern collection. For example, 
the patterns could be ordered based on how well the prefixes of the 
ordering describe the collection. 

Problem 4.3 (pattern ordering). Given a pattern collection V, 
an interestingness measure 0, an estimation function ip and a loss 
function find an ordering pi, . . . ,P|-p| of the patterns such that 

n^, V'(-, < n^, ^(-^ (4.4) 

for ah i G {1, . . . , \V\} and j G {i, . . . , \V\}. 

The pattern ordering can be seen as a refining approximation of 
the pattern collection: the first pattern in the ordering describes 
the pattern collection at least as well as any other pattern in the 
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collection, the second pattern is the the best choice if the first pat- 
tern is already chosen to the representation. In general, the kth 
pattern in the ordering is the best choice to improve the estimate 
given the first k — 1 patterns in the ordering. 

Algorithm 4.1 The pattern ordering algorithm. 

Input: The collection V of patterns, the interestingness measure 

(j), the estimation method il^, and the loss function i. 
Output: The optimal pattern ordering as defined by Equation 14.41 
and the loss Ei = i{(j),tp{-,(l)\-p^)) for each i-prefix Vi of the 
pattern ordering. 
1: function Order- Patterns(7^, i/^, ^) 
2: Po ^ 

3: for i = 0,...,\V\-l do 

4: pi+i ^ argminpgp\p^ {H4>,i^{-, 4>\p,u{p}^^} 

5: Vi+i ^ViU {pi+i} 

7: end for 

8: return ((pi, . . . , (ei, . . . , e|p|)) 

9: end function 



The pattern ordering and the estimation errors for all prefixes 
of the ordering can be computed efficiently by Algorithm 14.11 The 
running time of the algorithm depends crucially on the complexity 
of evaluating the expression i{(j){V),tp{V, (plvtuip})) foi' each pattern 
p £ V \ Vi and for all i = 0, . . . , |P| — 1. If M{V) is the maximum 
time complexity of finding the pattern Pi+i that improves the prefix 
Vi as much as possible with respect to the estimation method and 
the loss function, then the time complexity of Algorithm 14.11 is 
bounded above by C'dT'l M{V)). Note that the algorithm requires 
at most 0(|'P| ) loss function evaluations since there are OdPl) 
possible patterns to be the ith pattern in the ordering. 

Example 4.6 (on the efficiency of Algorithm 14. 1|) . Let the 

estimation method be 



'lpsimple{P,<P\s) = 




if p G iS and 
otherwise. 



i.e., let the quality values be zero unless explicitly given, and let 
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the loss be the sum of the differences 



(I>{P) - '>Psimple{P,4'\s) = 



if p E 5 and 
(pip) otherwise. 



Then finding the pattern that improves to solution the most 
can be found in time logarithmic in |P| by using a heap Knu98] . 
More specifically, each quality value of a pattern in V is put into 
the heap in time 0(1^1 log {Vl). The best pattern can be found in 
each iteration by picking the pattern with highest quality value in 
the heap. Thus, the total running time of the algorithm is then 
CdT'l log IPI). (Note that the optimal pattern ordering could be 
obtained in this case also by sorting the patterns with respect to 
their quality values.) □ 

The patterns could be ordered also by starting with the whole 
pattern collection V and repeatedly removing from the collection 
the pattern whose omission increases the error least, rather than 
starting with an empty collection and adding the pattern that de- 
creases the error most. 

If the pattern ordering and the errors for all of its prefixes are 
computed (as Algorithm l4.1l doesl. then the user can very efficiently 
explore the trade-offs between the size and the accuracy: If the 
number of patterns is overwhelming, then the user can consider 
shorter prefixes of the pattern ordering. If the accuracy of the 
estimates is not high enough, then the user can add more patterns 
to the prefix. 

Furthermore, this exploration can be done very efficiently. Find- 
ing the prefix of length k can always be implemented to run in 
constant time by representing the pattern ordering as an array of 
patterns. The shortest prefix with error at most a given thresh- 
old e can be found in time 0(1^1) by scanning the array of patterns 
sequentially. Similarly, the prefix of length at most k with the small- 
est error can be found in time linear in \V\. If the loss function is 
nonincreasing, i.e., it is such that 



for each p ^ S and each 5 C "P, then the time consumption of these 
tasks can be reduced to ©(log \P\) by a simple application of binary 
search. 



^(•,<^|5) < (•,'/' 1 5\{p}) 



82 



^ Trade-offs between Size and Accuracy 



In addition to efficient exploration of trade-offs between the size 
and the accuracy, the pattern ordering can shed some hght to the 
relationships between the patterns in the collections. For exam- 
ple, the prefixes of the pattern ordering suggest which patterns are 
complementary to each other and show which improve the quality 
value estimation. 



4.2 Approximating the Best A;-Subcollection of 
Patterns 

On one hand, the problem of finding the best fc-subcollection of 
patterns is NP-hard as shown by Theorem 14.11 Thus, there is not 
much hope for polynomial-time algorithms for finding the best k- 
subcollection in general. On the other hand, the optimal pattern 
ordering can be found by Algorithm 14.11 Furthermore, the greedy 
procedure (of which Algorithm 14.11 is one example) has been rec- 
ognized to provide efficiently exact or approximate solutions for a 
wide variety of other problems [E^iMl ITTKQQ . HMS93l. lKKTn3j . Ac- 
tually, Algorithm 14 . II provides the optimal solution for some special 
cases. For example, the prefixes of the optimal pattern ordering 
for Example 14. HI are also the best subcollections. Furthermore, the 
optimal pattern ordering always determines the best pattern to de- 
scribe the quality values of the whole collection. Unfortunately it 
does not provide necessarily the optimal solution for an arbitrary 
value of k. 

Example 4.7 (the suboptimality of the optimal pattern or- 
dering). Let the pattern collection V be 2^'^'^''-"^ and the inter- 
estingness measure be the support. Let the support of the itemset 
{A,B,C} be 1 and the other supports be 3. Furthermore, let the 
estimation method be as defined by Equation 14.21 and let the loss 
function be the euclidean distance, i.e., Equation 14.31 with p = 2. 

Then the initial loss is 55. The best 3-subcollection consists 
of itemsets {A,B}, {A,C} and {B,C} with the loss 1 whereas 
Algorithm 14.11 chooses the itemset {A,B,C} instead of one of the 
2-itemsets, resulting the loss 4. The decreases of losses are 54 and 
51, respectively. □ 
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There is still a possibility, however, that the optimal pattern 
ordering provides reasonable approximations for at least some k- 
subcollections, loss functions and estimation methods. 

In fact, under certain assumptions about the estimation method 
and the loss function i, it is possible to show that each fc-prefix 
of the pattern ordering is within a constant factor from the corre- 
sponding best fc-subcollection of patterns in 7^ for all A; = 1, . . . , IT'I 
simultaneously. 

More specifically, if the estimation method and the loss func- 
tion £ together satisfy certain conditions, then for each fc-prefix Vj. 
of the pattern ordering the decrease of loss 

£(<^,V'(-,0|0))-£(<^,V(-,</'bJ) 

is within the factor 1 — 1/e > 0.6321 from the maximum decrease 
of loss 

for any /s-sub collection of V, i.e., 

^(,/),^(-,</)|0))-^((^,0(-,(.6|^J) e-1 

for ah k e {!,..., IT'I}. 

To simplify the notation, we use the following shorthands: 

6i = i{^,tl;{;cPU))-e{q),>jj{;cb\v,)) 
S* = £{cP,ij{;^U))-i{cp,^{;<P\p^)) 

The pattern collection V, the interestingness measure (j), the esti- 
mation method ip and the loss function £ are assumed to be clear 
from the context. 

First we show that if the loss decreases sufficiently from the i — 1- 
prefix to the i-prefix for all i = 1, . . . , \V\, then > (1 — 1/e) (5^ 
holdsforanA: = l,...,|P|. 



Lemma 4.1. // 

Si - Si-i > -{61- (5j_i) 



(4.5) 
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holds for all i and k with 1 < i < k < \V\ then 

for all k = 1, . . . , 

Proof. From Equation 14.51 we get 

<5,: > 



> 




since by definition 5q = — Eq = Q. 
Thus, 




as claimed. □ 

The approximation with respect to the optimal loss is not so 
easy. In fact, the optimal pattern ordering does not provide any 
approximation ratio guarantees in general: there might be a col- 
lection V^. of k patterns that provide zero loss estimation of (p but 
still the fc-subcollection chosen by Algorithm 14. II can have non-zero 
loss. (Note that also in the Example 14.71 the ratio of losses is 4 
whereas the ratio between the decreases of losses is 17/18.) Still, 
we can transform Lemma l4. II to give bounds for the loss instead of 
the decrease of the loss. 

Lemma 4.2. // 

for all i and k with i < i < k < {Vl then also 
e.<(l-i)4 + ^eo 
holds for all k = 1, . . . ,\V\. 
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Proof. First note that 

Ej-l — £i > Y i^i-l — 



>y(SI- di-i) 



and second that 



Sfc < (^1 - -J 4 + -eo h 
Thus, Lemma l4. II gives the claimed result. 




□ 



The bound given by Lemma l4.2l is considerably weaker than the 
bound given by Lemma l4. II due to the additive term of a constant 
fraction of the initial error, i.e., the error of our initial assumption 
about the quality values. 

Still, the prefixes of the optimal pattern ordering serve as good 
representative /c-subcollections of V for all values of k simultane- 
ously, in addition to being a refining description of the quality values 
of the pattern collection. 

4.3 Approximating the Quality Values 

As a more concrete illustration of the approximation abilities of the 
pattern orderings, in this section we shall consider the orderings 
of patterns in downward closed collections V with anti-monotone 
interestingness measures when the quality value of a pattern is es- 
timated to be the maximum of the quality values of its known su- 
perpatterns (the collection S), i.e., 



Note that this estimation method was used also in Example 14.31 
The next results show that the estimation method ipMax gives the 
correct quality values for all patterns in V exactly when the sub- 
collection used in the estimation contains all closed patterns in the 
collection V. 

Theorem 4.3. The collection Cl{V) of the closed patterns in V is 
the smallest subcollection ofV such that 



1pMax{P,(p\s) 



= max 



{<P{p') : p p' £ S} . 



(4.6) 



4>ip) = i'Max{p,4>\ci{V)) 



for all p £ v. 
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Proof. By definition, for each pattern p ^ V tliere is a pattern 
p' G Cl{V) sucli that p ^ p' and (j){p) = 4>{p')- As we assume 
that the interestingness measure (j) is anti-monotone, taking the 
maximum quality value of the superpatterns p' € cl(V) of a pattern 
p & V determines to quality value of p correctly. Thus, </>|ci(-p) is 
sufficient to determine 4>\p. 

To see that all closed patterns in V are needed, notice that the 
quality value of a pattern p G Cl{V) is greater than any of the 
quality values of its proper superpatterns. Thus, the quality values 
of the patterns in Cl{V) as (t)\ci{v) is needed to determine even 
(t>\ci(v) using the estimation method tpMax- D 

Thus, the problem of finding the smallest subcollection S of V 
such that i{(l){p) , xpMax{p, (pls)) = with respect to the estimation 
method tpMax and any reasonable loss function i (i.e., a loss function 
such that only the correct estimation of (f>\-p has zero loss and such 
that the loss can be evaluated efficiently for any p £ V) can be 
solved efficiently by Algorithm 12.31 

If some error is allowed, then the complexity of the solution de- 
pends also on the loss function. Let us first consider the maximum 
absolute error 

iMaxi(p,'4'Max{-,4>\s)) = max{0(p) - tpMax{P, (pls)} , 

i.e., the loss defined by Equation 14.31 with p = oo. 

By Theorem 14.11 the problem of finding the best fe-subcollection 
of V with the loss at most e is NP-hard even when V = J- {a, T>) and 
4> = fr. The maximum absolute error is not very informative loss 
function since it does not take into account the number of patterns 
with error exceeding the maximum error threshold e. Still, it makes 
a difference whether there is one or one million patterns exceeding 
the maximum absolute difference threshold. 

If the loss function is the number of frequencies that are not 
estimated with the absolute error at most e, i.e., 

^Max,e{(f'^^Max{-,4'\s)) = \{p £ V : - TpMaxip, 4>\s)\ > e}| 

(4.7) 

then the problem can be modeled as a special case of the maximum 
fc-coverage problem (Problem 14. 4j) . 
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Problem 4.4 (maximum fc-coverage |ACK+99p . Given a col- 
lection C of subsets of a finite set S and a positive integer k, find 
a A;-subcollection C" of C with the largest coverage of S, i.e., the 
collection C" C C of cardinality k that maximizes the cardinality 

Theorem 4.4. Let the estimation method be ipMax fEquation \4.b\ ) 
and the loss function be iMax,t (Equation Then Problem \41\ 

is a special case of Problem \4-4\ 

Proof. The reduction from an instance {V,(p, ipMax , ^Max,e , k) of Prob- 
lem 14.11 to an instance (C, S, k) of Problem 14.41 is straightforward. 
The set S consists of all patterns in "P, and C consists of sets 
{p' -.p' < p, (pip') - 4>{p) < e} for each peV. □ 

If the sum of errors is used instead of the maximum absolute 
error, the following approximation bounds can be guaranteed: 

Theorem 4.5. For the length-k prefix Vk of the optimal solution 
for the pattern ordering problem of the pattern collection V and the 
best k-subcollection VI ofV, we have 

«.>(i4)« 

for the estimation method ipMax (Equation \4-(^ and for any loss 
function 

^/(0,V'Ma.(-,(/'|5)) = J^/(</'(p)-V'Ma.(p,</'|5)) (4.8) 

pev 

where f is an increasing function. 

Proof. Using Lemma l4.ll it is sufficient to show that Equation 14.51 
holds. 

Let Pi,...,P|-p| be to ordering of the patterns in V given by 
Algorithm 14.11 and let Vi = {pi, . . . ,pi}. The pattern collection V 
can be partitioned into k groups Vp,p S P^, as follows: 

'Pp^ = ^p^Pi:i = min |j G {1, . . . , \V\} : (p{pj) = ^Max{p, (AIp*)}} • 
Note that 



(^f{(t>,'^Max{-A\s)) > (f{<P,ll^Max{-,4>\suS')) 
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for all iS, S' C ■p. This implies that also 

if{4',i'Max{-,(p\v,)) > V'Max(-,</'|-p,UT';)) 

for ah i,k£ {!,..., |P|}. 

For any i, k G {1, . . . , {Vl}, the decrease of loss 6^ — (5j_i can be 
written as 

"^finp) - i^MaxiP,4>\r,-iUV*)) ) - Si^l- 
and it can be further partitioned into sums 

^ ifiHp') - '^MaxiP, 4>\v^-iU{p})) - fiHp') - i^MaxiP, 4>\p,-i))) 

for each p G V^. At least one of those sums must be at least 1/k- 
fraction of 51 — Thus, the claim holds. □ 

Furthermore, the search for the optimal pattern ordering using 
the estimation method ^pMax (Equation I4.6j) can be speeded up by 
considering, without loss of generality, only the closed patterns: 

Theorem 4.6. For all loss functions i and all subcollections S of 
the pattern collection V we have 

K<P,'(pMax{-,4>\s)) = ^{<P,'(pMaxi-, (p\{cl{p):peS})) 

Proof. Any pattern p G V can be replaced by its closure cl{p, ^, cp) 
since (p{p) = (f){cl{p)). Furthermore, if p' ^ p then p' ^ cl{p) for all 
p,p'£r. □ 

Example 4.8 (ordering 0.20-frequent itemsets in the course 
completion database). Let us examine how the estimation method 
''pMax orders the 0.20-frequent itemsets in the course completion da- 
tabase (see Subsection 12.2. ip when the loss function is the average 
of the absolute differences. 

The averages of the absolute differences in the frequency esti- 
mates for all prefixes up to length 2000 of the pattern ordering are 
shown in Figure ETTl It can be noticed that the error decreases quite 
quickly. 




Figure 4.1: The average of the absolute differences between the 
frequencies estimated by Equation 14. HI and the correct frequencies 
for each prefix of the pattern ordering. 



The decrease of the error does not tell much about the other 
properties of the pattern ordering. As the estimation method is 
taking the maximum of the frequencies of the known superitem- 
sets, it is natural to ask whether the first itemsets in the ordering 
are maximal. Recall that the number of maximal 0.20-frequent 
itemsets is 253 and the number of closed 0.20-frequent itemsets is 
2136. The first 46 itemsets in the ordering are maximal but after 
that there are also itemsets that are not maximal in the collec- 
tion 0.20-frequent itemsets in the course completion database. The 
last maximal itemset appears as the 300th itemset in the ordering. 
The interesting part of ratios between the non-maximal and the 
maximal itemsets in the prefixes of the ordering is illustrated in 
Figure (Note that after the 300th itemset in the ordering, the 
ratio changes by an additive term 1/253 for each itemset.) 

One explanation for the relatively large number of maximal item- 
sets in the beginning of the ordering is that the initial estimate for 
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0.25 




50 100 150 200 250 

the length of the prefix of the ordering 



Figure 4.2: The ratio between the non-maximal and the maximal 
itemsets in the prefixes of the pattern ordering. 

all frequencies is 0, whereas the frequency of each maximal itemset 
is in our case at least 0.20. Furthermore, the first maximal itemsets 
are quite large (thus containing a large number of at least 0.20- 
frequent itemsets) and the majority of the frequencies of the 0.20- 
frequent itemsets are within 0.025 to 0.20. Still, the itemset order- 
ing differs considerably from first listing all maximal 0.20- frequent 
itemsets and then all other 0.20-frequent itemsets. 

For a more refined view to the ordering, the average cardinality 
of the itemsets in each prefix is shown in Figure and the number 
of itemsets of each cardinality in each prefix is shown in Figure 

The average cardinality of the itemsets in the prefixes drops quite 
quickly close to the global average cardinality. That is, after the 
initial major corrections in the frequencies (i.e., listing some of the 
largest maximal itemsets) there are both small and large itemsets 
in the ordering. Furthermore, the itemsets of all cardinalities are 
listed quite much in the same relative speed. Thus, based on these 
statistics, the method seems to provide some added value compared 




Figure 4.3: The average cardinality of the itemsets. 
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Figure 4.4: The number of itemsets of each cardinaUty. 
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to listing the itemsets levelwise from the largest to the smallest 
cardinality as well as listing the itemsets in the order of increasing 
frequency. That is, the method seems to list one itemset here and 
another there, giving a refining view to the itemset collection, as 
hoped. □ 



4.4 Tiling Databases 

In this section we illustrate the use of pattern orderings as refining 
description of data, transaction databases in particular. 

A transaction database V can be seen as an n x m binary matrix 
Mj) such that 



1 if ^ G X for some (i, X) e V 
otherwise. 



Viewing transaction databases as binary matrices suggests also 
pattern classes and interestingness measures different from itemsets 
and frequencies. 

For example, it is not clear why itemsets (i.e., sets of column 
indices) would be especially suitable for describing binary matri- 
ces. Instead of sets of column indices, it could be more natural to 
describe the matrices by their monochromatic submatrices. Fur- 
thermore, as the transaction databases are often sparse, we shall 
focus on submatrices full of ones, i.e., tiles |GGM04j . also known 
as bi-sets |BR,Bn4j . and closely related to formal concepts KTW99j . 
(Some other approaches to take also the transaction identifiers into 
account to choose a representative collection of itemsets are de- 
scribed in |TKR+95llWKn4j .1 As a quality measure we shall con- 



sider the areas of the tiles. 

Definition 4.1 (tiles, tilings and their area). Let P be a trans- 
action database over I. 

A tile t{C, X) is a set C X X such that C C tidiV) and X QI. 
The sets C and X can be omitted when they are not of importance. 
A tile r(C, X) is contained in D if for each {i, A) G t{C, X) there is 
a transaction {i,Y) G T> such that A (^Y (and thus X C y, too). 

A tile r(C, X) is maximal in V if it is contained in V and none 
of its supertiles is contained in D, i.e., a tile t{C,X) is maximal in 
V if r(C, X) = T{cover{X, V), cl{X, V)). 
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The area of a tile t{C,X) is 
area{T{C,X)) = 



t{C,X) 



CI \x\ . 



The area of an itemset X in P is the same as the area of the tile 
T {cover {X,V),X). 

A tiling T is a collection of tiles. A tiling is contained in D if all 
tiles in the tiling are in V. 

The area of a tiling T is 



The motivation to consider the area of tilings is the following. As 
the transaction databases V is typically sparse, it might be a good 
idea to describe V by indicating where are the ones in the binary 
matrix by the row and the column indices of submatrices full of 
ones. Thus, the quality of a tile or a tiling is measured by the 
number of ones covered by it whereas the goal in frequent itemset 
mining is to find as high tiles as possible. 

Tiles and tilings are most suitable for pattern ordering since the 
area of a tiling determines a natural loss function for tilings. 

More specifically, the task of tiling transaction databases can be 
formulated as a special case of Problem 14.31 The pattern collection 
V is the collection of tiles in V. The interestingness measure (p is 
the area of the tile. The estimation method does not depend on 
the areas of the known tiles but just the tiles. The known tiles are 
sufficient to determine the areas of all subfiles of them. (Similarly, 
in the case of frequent itemsets, a subcoUection of frequent itemsets 
would be sufficient to determine the cardinalities of those frequent 
itemsets and their subitemsets.) A loss function £ can be, e.g., the 
number of ones not yet covered, i.e.. 



Thus, the number of transaction databases with \T>\ transactions 
over I that are compatible with the tiling T is at most 




The closure of a tiling T in D is 



d(T, V) = {T{cover{X, V), cl{X, V)) : t(C7, X) G T} . 



larea{V,T) = \{{i,A) : {l, X) V , A X} 



area{T). 



|P| \I\ — area{T) 
\{{i,A) : {i,X) eV,AeX}\- area{T) 




(4.9) 
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Note that we use the transaction database V and the tihng T 
as the parameters of the loss functions, instead of the area function 
for tiles in D and the area function for tiles in T, since they have 
the same information as the area function for all tiles in D and the 
area function for the tiles in the tiling T. 

Proposition 4.1. Each k-prefix of the best ordering of tiles in D 
with respect to the loss function (.area defines a tiling Tk that has 
area within a factor 1 — 1/e from the best k-tiling Tj^ in T>. 

Proof. Based on Lemma 14.11 it is sufficient to show that Equa- 
tion holds, i.e., that 

area{Ti) — area{Ti^i) > — {area{T^) — area{Ti-i)) 

for all i and k such that 1 < i < k. 

Let 7^* = Tjj, . . . , Tji^. There must be on r* S Tj* such that 

area(Ti^i U {r*}) — area(Ti^i) > — iareaiTu) — areaiTi^i)) 

k 

since 

area{T^) < area^r). 

Thus, there is a tile Tj in D but not in 7i_i such that area{Ti^i U 
{ti}) ^ area{Ti-i U {t*}), i.e., the claim holds. □ 

Hence, each prefix of the ordering of tiles in T) gives a good 
approximation for the best tiling of the same cardinality. All we 
have to do is to find the tiles in T>. 

The first obstacle for finding the tiles in T) is that the number 
of tiles can be very large. A slight relief is that we can restrict our 
focus to maximal tiles in V instead of all tiles. 

Proposition 4.2. Replacing the tiles t{C, X) of a tiling T inT> by 
the maximal tiles T{cover{X,'D), cl{X,'D)) in V does not decrease 
the area of the tiling. 

Proof. This is immediate since t(C, X) C T{cover{X^T)), cl{X^T))) 
for each tile t(C, X) € T contained in and thus 

y t(C,X)C IJ T{cover{X,V),cl{X,V)) 
T{c,x)&r T{c,x)eT 

which implies that area{T) < area{cl{T,'D)). □ 
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The number of maximal tiles in V could still be prohibitive. 
The number of maximal tiles to be considered can be decreased by 
finding only large maximal tiles, i.e., the maximal tiles with area at 
least some threshold. Unfortunately, even the problem of finding 
the largest tile is NP-hard j(X:^lVl()4[IPeen3] . but there are methods 
that can find the large maximal tiles in practice |BRB04l IGGM04j . 

Nevertheless, ordering the large maximal tiles is not the same 
as ordering all maximal tiles. Although the number of all maximal 
tiles might be prohibitive, it would possible construct any prefix 
of the optimal ordering of the maximal tiles in V if we could find 
for any prefix % of the ordering the tile Tj+i in V that maximizes 
areaiTi U {rj+i}). Clearly, also this problem is NP-hard but in 
practice such tiles can be found reasonably efficiently |GGM04] . 

Example 4.9 (Tiling the course completion database). Let 

us consider the course completion database (see Subsection 12.2. Ij) . 

We computed the greedy tilings using Algorithm 14. II and an al- 
gorithm for discovering the tile that improves the current tiling as 
much as possible. We compared the greedy tiling to the tiling ob- 
tained by ordering all frequent and maximal frequent itemsets by 
their frequencies. The greedy tiling is able to describe the database 
quite adequately: the 34 first tiles (shown in Table l^?T|) in the tiling 
cover 43.85 percent (28570/65152-fraction) of the ones in the data- 
bases. As a comparison, the 34 most frequent itemsets and the 34 
most frequent closed itemsets (shown in Table W?]\ cover just 19.13 
percent (12462/65152-fraction) of the database. Furthermore, al- 
ready the 49 first tiles in the greedy tiling cover more than half of 
the ones in the database. 

The relatively weak performance of frequent itemsets can be ex- 
plained by the fact that frequent itemsets do not care about the 
other frequent itemsets and also the interaction between closed 
itemsets is very modest. Furthermore, all of the 34 most frequent 
itemsets are quite small, the three largest of them consisting of only 
three items, whereas 22nd tile^ contains 23 items. 

^The tile has the largest itemset within the 34 first tiles in the tiling and 
it consists almost solely of courses offered by the faculty of law. The group of 
students inducing the tile seem to comprise former computer science students 
who wanted to become lawyers and a couple of law students that have studied 
a couple of courses at the department of computer science. 
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Table 4.1: The 34 first itemset in the greedy tihng of the course 



completion database. 



supp {X, V) 


area{X, V) 


X 


411 


4521 


{0,2,3,5,6,7,12,13, 14,15,20} 


1345 


2690 


{0,1} 


765 


3060 


{2,3,4,5} 


367 


1835 


{2,21,22,23,24} 


418 


1672 


{7,9,17,31} 


599 


1198 


{8,11} 


513 


1026 


{16,32} 


327 


1635 


{7, 14, 18,20,27} 


706 


2118 


{0,3,10} 


357 


1428 


{6,7,17,19} 


405 


1215 


{0,24,29} 


362 


1810 


{7,9,13,15,30} 


197 


985 


{2,19,33,34,45} 


296 


1184 


{2,3,25,28} 


422 


844 


{18,26} 


166 


830 


{21,23,37, 43,48} 


269 


538 


{36,52} 


393 


786 


{3,35} 


329 


1645 


{6,7,13,15,41} 


221 


442 


{40,60} 


735 


2205 


{2,5,12} 


20 


460 


{1, 11, 162, 166, 175, 177, 189, 191, 






204, 206, 208, 209, 216, 219, 223, 226, 






229, 233, 249, 257, 258, 260, 272} 


294 


882 


{14, 20,38} 


410 


410 


{39} 


852 


1704 


{1,8} 


313 


939 


{7,9,44} 


193 


386 


{42,49} 


577 


1154 


{21,23} 


649 


649 


{25} 


1069 


1069 


{6} 


939 


1878 


{0,4} 


336 


336 


{46} 


264 


528 


{19,50} 


328 


328 


{47} 
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Table 4.2: The 34 most frequent (closed) itemsets in the course 
completion database. 



supp(X, U) 


area[X, u) 


X 


2405 








2076 


2076 


{0} 


1547 


1547 


{1} 


1498 


1498 


{2} 


1345 


2690 


{0,1} 


1293 


2586 


{0,2} 


1209 


1209 


{3} 


1098 


2196 


{2,3} 


1081 


1081 


{4} 


1071 


1071 


{5} 


1069 


1069 


{6} 


1060 


1060 


{7} 


1057 


2114 


{1,2} 


1052 


2104 


{0,3} 


1004 


2008 


{3,5} 


992 


1984 


{2,5} 


983 


1966 


{2,7} 


971 


1942 


{2,6} 


960 


2880 


{2,3,5} 


958 


2874 


{0,2,3} 


943 


1886 


{0,5} 


939 


1878 


{0,4} 


931 


931 


{8} 


924 


1848 


{0,7} 


921 


1842 


{0,6} 


920 


920 


{9} 


915 


2745 


{0,1,2} 


911 


1822 


{1,3} 


896 


1792 


{6,7} 


887 


2661 


{0,3,5} 


880 


1760 


{3,4} 


875 


2625 


{0,2,5} 


870 


1740 


{2,4} 


862 


2586 


{0,2,7} 
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Table 4.3: The 34 maximal itemsets with minimum support thresh- 
old 700 in the course completion database. 



supp{X, T)) 


areai^X, T)) 


A 


748 


748 


{15} 


744 


744 


{16} 


733 


733 


{17} 


707 


707 


{18} 


709 


1418 


{0,11} 


700 


1400 


{7,13} 


721 


1442 


{7, 14} 


732 


2196 


{0,1,4} 


730 


2190 


{0,1,5} 


712 


2136 


{0,1,7} 


741 


2223 


{0,1,8} 


749 


2247 


{0,2,9} 


706 


2118 


{0,3,10} 


721 


2163 


{1,2,6} 


755 


2265 


{1,2,7} 


750 


2250 


{2,3,6} 


738 


2214 


{2,3,9} 


724 


2172 


{2,3,10} 


705 


2115 


{2,5,6} 


716 


2148 


{2,5,10} 


726 


2178 


{2,7,9} 


705 


2115 


{3,5,6} 


720 


2160 


{3,5, 10} 


704 


2112 


{3,6,7} 


737 


2948 


{0,1,2,3} 


722 


2888 


{0,2,3,4} 


849 


3396 


{0,2,3,5} 


741 


2964 


{0,2,3,7} 


729 


2916 


{0,2,6,7} 


706 


2824 


{0,3,4,5} 


749 


2996 


{1,2,3,5} 


765 


3060 


{2,3,4,5} 


757 


3028 


{2,3,5,7} 


727 


2908 


{2,3,5,12} 
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A slightly better performance can obtained with 34 maximal 
itemsets (shown in Table : the 34 maximal itemsets determine 
a tiling that covers 26.64 percent (17356/65152-fraction) of the da- 
tabase. (The 34 maximal itemsets were obtained by choosing the 
minimum support threshold to be 700. This is also the origin of 
choosing the value 34 as the number of illustrated itemsets.) Maxi- 
mal itemsets depend more on each other since the maximal itemsets 
form an antichain (see Chapter for more details). 

It can be argued that we could afford a slightly larger number 
of frequent itemsets since they are in some sense simpler than the 
tiles. We tested this with the collections of the closed 0.20-frequent 
itemsets (2136 itemsets) and the maximal 0.20-frequent itemsets 
(253 itemsets) which have been used in previous real examples. 
They cover 43.80 percent (28535/65152-fraction) and 41.12 percent 
(26789/65152-fraction), respectively. That is still less than the 34 
first tiles in the greedy tiling. □ 

Thus, sometimes the pattern ordering can be computed incre- 
mentally although generating the whole pattern collection would 
be infeasible. Still, there are many ways how the tilings could be 
improved. 

First, the definitions of tiles and tilings could be adapted also to 
submatrices full of zeros since a submatrix full of zeros is equivalent 
to a submatrix full of ones in the binary matrix where all bits are 
flipped. 

Second, the complexity of describing of a particular tile could 
be taken in to account. Assuming no additional information about 
the matrix, a t(C, X) in D can be described using 

|C|log|P| + |X[log|T| 

bits. However, taking into account the encoding costs of the tiles, 
it is not sufficient to consider only maximal tiles. 

Example 4.10 (Maximal tiles with costs are not optimal). 

Let the the transaction database P consist of two transactions: 
(1,{j4}) and {2, {A, B}) . Then the maximal tiles describing 2? 
are {(1, ^) , (2, ^)} and {{2, A) , {2, B)} whereas tiles {(1,^)} and 
{(2, A) , (2, B)} would be sufficient and slightly cheaper, too. □ 

Third, the bound for the number of databases given by Equa- 
tion 14.91 does not take into account the fact that the tiles in the 
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tiling are maximal. Let tid{T) and 2^ be the transaction identifiers 
and the items in a tile r, respectively. The maximality of the tile r 
restricts the collection of compatible databases V as follows. The 
tile r must be in the compatible database T>. For each transaction 
identifier i G tid(T>) \ tid{T) there must be an item A such 
that {i,X) G T> does not contain A. For each item I\Zt there 
must be a transaction identifier i £ tid{T) such that {i,X) G D 
does contain A. The collection of the transaction databases com- 
patible with a tiling T is the intersection of the collections of the 
transaction databases compatible with each tile in T. 

4.5 Condensation by Pattern Ordering 

We evaluated the ability of the pattern ordering approach to con- 
dense the collections of the cr-frequent itemsets using the estimation 
method 



V'Max(^,/r|^(,,^),) = max{/r(y,P) -.X^Ye T{a,Vy} . 



where J-{cr, V)' is the subcollection of cr-frequent itemsets for which 
the frequencies are known (see also ExamDle l4.3|) . The loss function 
used in the experiments was the average absolute error with uniform 
distribution over the itemset collection, i.e., 



The pattern orderings were found by the algorithm Order- 
Patterns (Algorithm 14. 1|) . Then we computed the pattern order- 
ings for cr-frequent itemsets in the transaction databases Internet 
Usage and IPUMS Census for several different minimum frequency 
thresholds a G [0,1]. The results are shown in Figure H31 and in 
Table I17H for Internet Usage and in Figure ITBl and in Table 1^31 for 
IPUMS Census. 

In the figures the axes are the following. The x-axis corresponds 
to the length of the prefix of the pattern ordering and the y-axis is 
corresponds to the average absolute error of the frequency estima- 
tion from the corresponding prefix. The labels of the curves express 
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Figure 4.5: Pattern orderings for Internet Usage data. 



a 







0.001 


0.005 


0.01 


0.02 


0.04 


0.08 


0.20 


1856 


1856 


1574 


1190 


900 


619 


418 


188 


0.19 


2228 


2228 


1870 


1396 


1052 


728 


486 


212 


0.18 


2667 


2667 


2217 


1625 


1206 


820 


522 


211 


0.17 


3246 


3246 


2672 


1925 


1421 


970 


597 


231 


0.16 


4013 


4013 


3254 


2295 


1671 


1132 


655 


242 


0.15 


4983 


4983 


3994 


2764 


1995 


1377 


775 


270 


0.14 


6291 


6290 


4955 


3339 


2362 


1602 


860 


261 


0.13 


8000 


7998 


6208 


4093 


2881 


1972 


1034 


281 


0.12 


10476 


10472 


7970 


5118 


3562 


2414 


1189 


289 


0.11 


13813 


13802 


10267 


6352 


4305 


2804 


1284 


264 


0.10 


18615 


18594 


13468 


8068 


5409 


3395 


1423 


245 


0.09 


25729 


25686 


18035 


10399 


6920 


4094 


1587 


203 


0.08 


36812 


36714 


24870 


13681 


9032 


5008 


1708 


153 


0.07 


54793 


54550 


35441 


18477 


12147 


6276 


1803 


95 


0.06 


85492 


84873 


52295 


25595 


16376 


7568 


1747 


29 



Table 4.4: Pattern orderings for Internet Usage data. 
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Figure 4.6: Pattern orderings for IPUMS Census data. 



a 


\J'ia,V)\ 





0.001 


0.005 


0.01 


0.02 


0.04 


0.08 


0.30 


8205 


1335 


444 


285 


212 


153 


107 


61 


0.29 


9641 


1505 


496 


317 


236 


167 


116 


65 


0.28 


11443 


1696 


551 


351 


260 


184 


120 


66 


0.27 


13843 


1948 


624 


395 


292 


203 


128 


68 


0.26 


17503 


2293 


725 


456 


338 


233 


147 


71 


0.25 


20023 


2577 


810 


502 


369 


256 


161 


77 


0.24 


23903 


3006 


944 


583 


427 


293 


185 


92 


0.23 


31791 


3590 


1093 


661 


477 


328 


196 


85 


0.22 


53203 


4271 


1194 


678 


481 


316 


171 


57 


0.21 


64731 


5246 


1454 


813 


573 


372 


189 


62 


0.20 


86879 


6689 


1771 


949 


661 


424 


218 


67 


0.19 


151909 


8524 


1974 


953 


628 


363 


151 


27 


0.18 


250441 


10899 


2212 


992 


625 


312 


99 


10 



Table 4.5: Pattern orderings for IPUMS Census data. 
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the minimum frequency thresholds of the corresponding frequent 
itemset collections. 

The tables can be interpreted as follows. The columns a and 
V) I correspond to the minimum frequency threshold a and the 
number of cr-frequent itemsets. The rest of the columns 0, 0.001, 
0.005, 0.01, 0.02, 0.04 and 0.08 correspond to the number of itemsets 
in the shortest prefix with the loss at most 0, 0.001, 0.005, 0.01, 0.02, 
0.04 and 0.08, respectively. (Note that the column corresponds 
to the number of closed frequent itemsets by Theorem 14., SL ) 

The results show that already relatively short prefixes of the 
pattern orderings provide frequency estimates with high accuracy. 
The inversions of the orders of the error curves in Figure 14.51 and 
in Figure 14.61 are due to the used combination of the estimation 
method and the loss functions: On one hand the average absolute 
error is lower for frequent itemset collections with lower minimum 
frequency threshold for the frequency estimation without any data 
since initially all frequency estimates of the frequent itemsets are 
zero. On the other hand the frequencies can be estimated correctly 
from the closed frequent itemsets and the number of closed frequent 
itemsets is smaller for higher minimum frequency thresholds. 
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CHAPTER 5 

Exploiting Partial Orders of 
Pattern Collections 



Large collections of interesting patterns can be very difficult to un- 
derstand and it can be too expensive even to manipulate all pat- 
terns. Because of these difficulties, recently a large portion of pat- 
tern discovery research has been focused on inventing condensed 
representations for pattern collections. (See Section 12.41 for more 
details.) 

Most of the condensed representations are based on relatively 
local properties of the pattern collections: the patterns in the con- 
densed representations are typically chosen solely based on small 
neighborhoods in the original pattern collection regardless of which 
of the patterns are deemed to be redundant and which are chosen 
to the condensed representation. 

Two notable exceptions to this are the condensed frequent pat- 
tern bases |PDZH02] and non-derivable itemsets |CG02j . Still, even 
these condensed representations have certain drawbacks and limi- 
tations. 

The construction of condensed frequent patterns bases is based 
on a greedy strategy: The patterns are pruned from minimal to 
maximal or vice versa. A pattern is deemed to be redundant (i.e., 
not being in the pattern base) if its frequency is close enough to 
the frequency of some already found irredundant pattern that is its 
super- or subpattern, depending on the processing direction of the 
pattern collection. Alas, also the condensed frequent pattern bases 
can be interpreted to be dependent only on the local neighborhoods 
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of the patterns, although the neighborhoods are determined by the 
frequencies rather than only by the structure of the underlying pat- 
tern class. 

The non-derivable itemsets (Definition 12.12)1 take into account 
more global properties of the pattern collection. Namely, the ir- 
redundancy (i.e., non-derivability) of an itemset with respect to 
derivability depends on the frequencies of its all subitemsets. How- 
ever, the irredundancy of the itemset in the case of non-derivable 
itemsets is determined using inclusion-exclusion truncations. Al- 
though the non-derivable itemsets can be superficially understood 
as the itemsets whose frequencies cannot be derived exactly from 
the frequencies of their subitemsets, it is not so easy to see immedi- 
ately which aspects of the itemset collection and the frequencies of 
the itemsets one particular non-derivable itemset represents, i.e., to 
see the essence of the upper and the lower bounds of the itemsets 
for the underlying transaction database. 

The pattern collections have also other structure than the quality 
values. (In fact, not all pattern collections have quality values at 
all. For example, the interesting patterns could be determined by 
an oracle that is not willing to say anything else than whether or 
not a pattern is fascinating.) In particular, virtually all pattern 
collections adhere some non-trivial partial order (Definition 12. 5() . 

The goal of this chapter is to make pattern collections more 
understandable and concise by exploiting the partial orders of the 
collections. We use the partial orders to partition a given pat- 
tern collection to sub collections of (in) comparable patterns, i.e., to 
(anti)chains. In addition to clustering the patterns in the collection 
into smaller groups using the partial order of the pattern class, we 
show that the chaining of patterns can also condense the pattern 
collection: for many pattern classes each chain representing pos- 
sibly several patterns can be represented as only a slightly more 
complex pattern than each of the patterns in the chain. 

In this chapter, we propose the idea of (anti) chaining patterns, 
illustrate its usefulness and potential pitfalls, and discuss the com- 
putational aspects of the chaining massive pattern collections. Fur- 
thermore, we explain how, for some pattern classes, each chain can 
represented as one slightly more complex pattern than the patterns 
in the underlying pattern collection. 

This chapter is based on the article "Chaining Patterns" [MieOSaj . 
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5.1 Exploiting the Structure 

The collections of interesting patterns (and the underlying pattern 
classes, too) have usually some structure. 

Example 5.1 (structuring the collection of itemsets by fre- 
quencies). The collection 2^ of all itemsets can be structured 
based on their frequencies: every subset of a frequent itemset is 
frequent and every superset of an infrequent itemset is infrequent. 
Thus, for each minimum frequency threshold a € [0,1], a given 
transaction database T) determines a partition 



The downward closed collections of frequent itemsets are exam- 
ples of data-dependent structures of pattern collections. The pat- 
tern collections have also some data-independent structure. Maybe 
the most typical data-independent structure in a pattern collection 
is a partial order over the patterns. 

Example 5.2 (set inclusion as a partial order over item- 
sets). Let the pattern class be again 2^ . A natural partial order 
for itemsets is the partial order determined by the set inclusion 
relation: 



A partial order where no two patterns are comparable, i.e., an 
empty partial order, is called a trivial partial order. For example, 
any partial order restricted to maximal or minimal patterns is triv- 
ial. A trivial partial order is the least informative partial order in 
the sense that it does not relate the patterns to each other at all. 

Besides of merely detecting the structure in the pattern collec- 
tion, the found structure can sometimes be further exploited. For 
example, the frequent itemsets can be stored into an itemset tree 
by defining a total order for over Z. In an itemset tree, each item- 
set corresponds to a path from root to some node the labels of the 
edges being the items of the itemsets in ascending order. (Itemset 
trees are known also by several other names, see 



{n<y.V),2^\J^{a,V)) 



of the itemset collection 2^ . 



□ 




for all X, y C T. 



□ 



EPYMOllEiEnni.) 
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Example 5.3 (an itemset tree). Let the itemset collection con- 
sist of itemsets 0, {A}, {A, B, C}, {A, B, D}, {A, C}, {B}, {B, C, D}, 
and {B,D}. The itemset tree representing this itemset collection 
is shown as Figure 




Figure 5.1: An itemset tree representing the itemset collection of 
Example 15.31 Each itemset can be seen in the tree as a path from 
the root to a solid node. 

□ 

Representing an itemset collection as an itemset tree can save 
space and support efficient quality value queries. The quality value 
of an itemset X can be retrieved (or decided that it is not in the 
itemset tree) in time Od^l). (Time and space complexities similar 
to itemset tries can be obtained also by refining the itemset trees to 
automata |Mie05aj .) Unfortunately, the structure of itemset trees 
is strongly dependent on the ordering of the items: there are not 
always natural orderings for the items and an arbitrary ordering 
can induce artificial structure to the itemset tree that hides the 
essence of the pattern collection. 

The exploitation of the partial order structure of a pattern col- 
lection somehow might still be beneficial although, for example, it 
is not clear whether the itemset tree makes a partial order of an 
itemset collection more understandable or even more obscure from 
the human point of view. A simple approach to reduce the obscu- 
rity of the itemset trees is to construct for each itemset collection 
a random forest of itemset trees where each itemset tree represents 
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the itemset collection using some random ordering of the items. 
(These random forests should not be confused with the random 
forests of Leo Breiman |Brefllj .) Unfortunately, the ordering of the 
items is still present in each of the itemset trees. Fortunately, there 
are structures in partial orders that do not depend on anything else 
than the partial order. Two important examples of such structures 
are chains and antichains. 

Definition 5.1 (chains and antichains). A subset C of a par- 
tially ordered set V is called a chain if and only if all elements in C 
are comparable with each other, i.e., p ^ p' oi p' ^ p holds for all 
P,p'eC. 

The rank of a pattern p in chain C, denoted by rank{p, C), is the 
number of elements that has to be removed from the chain before 
p is the minimal pattern in the chain. 

A subset ^ of a partially ordered set V is called an antichain if 
and only if all elements in C are incomparable with each other, i.e., 
p < p' holds for no p^p' € C. 

Example 5.4 (chains and antichains). The itemset collection 

{{A, B, C, D, E, E} , {A, C, E} , {C, E} , {C}} 

is a chain with respect to the set inclusion relation since all itemsets 
in the collections are comparable with each other. Similarly, the 
itemset collection {{^, 5} C}} is an antichain. The itemset 
collection {{A, B} , {A} , {B}} is not a chain nor an antichain: the 
collection is not a chain since {A} and {B} are not comparable, 
and it is not an antichain since {A, B} is comparable with {A} and 
{B}. □ 

A chain or an antichain in a pattern collection can be understood 
more easily than the whole pattern collection, since each pattern 
either does or does not have relationship with each other pattern 
in the chain or the antichain, respectively. Thus, a natural ap- 
proach to make a pattern collection more digestible using a partial 
order structure is to partition the pattern collection into chains or 
antichains. 

Definition 5.2 (chain and antichain partitions). A chain par- 
tition (an antichain partition) of a partially ordered set P is a parti- 
tion of the set V into disjoint chains Ci, . . . ,Cm (disjoint antichains 

Al, . . . , Am)' 
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A chain partition (an antichain partition) of V is minimum if 
and only if there is no chain partition (no antichain partition) of V 
consisting of a smaller number of chains (antichains). 

A chain partition (an antichain partition) of V is minimal if and 
only if there are no two chains Ci and Cj (antichains A4, and Aj ) in 
the chain partition (the antichain partition) such that their union 
is a chain (an antichain). 

A chain or an antichain partition can be interpreted as a struc- 
tural clustering of the patterns. Each chain represents a collection 
of comparable patterns and each antichain a collection of incom- 
parable ones, i.e., a chain consists of structurally similar patterns 
whereas an antichain can be seen as a representative collection of 
patterns. 

The minimum chain partition is not necessarily unique. The lack 
of uniqueness is not only a problem because of the exploratory na- 
ture of data mining. Different partitions highlight different aspects 
of the pattern collection which can clearly be beneficial when one 
is trying to understand the pattern collection (and the underlying 
data set). 

The maximum number of chains (antichains) in a chain partition 
(an antichain partition) of a pattern collection V is \V\ since each 
pattern p G P as a singleton set {p} is a chain and an antichain 
simultaneously. The minimum number of chains in a chain partition 
is at least the cardinality of the largest antichain in V since no two 
distinct patterns in the largest antichain can be in the same chain. 
This inequality can be shown to be actually an equality and the 
result is known as Dilworth's Theorem: 

Theorem 5.1 (Dilworth's Theorem fJ uk01| ^. A partially or- 
dered set V can he partitioned into m chains if and only if the largest 
antichain in V is of cardinality at most m. 

Example 5.5 (bounding the number of chains from below). 

The maximal patterns in a pattern collections form an antichain. 
Thus, the number of maximal patterns is a lower bound for the 
number of chains in the minimum chain partition. □ 

Similarly to bounding the minimum chain partitions by maxi- 
mum antichains, it is possible to bound the minimum number of 
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antichains needed to cover all patterns in V by the cardinality of 
the maximum chain in V: 

Theorem 5.2 ([SchOSj). The number of antichains in a mini- 
mum antichain partition of a partially ordered set V is equal to the 
cardinality of a maximum chain in V . 

5.2 Extracting Chains and Antichains 

The problem of finding the minimum chain partition for a partially 
ordered pattern collection V can be formulated as follows: 

Problem 5.1 (minimum chain partition). Given a pattern col- 
lection V and a partial order -< over V , find a partition of V into 
the minimum number of chains Ci , . . . , drn • 

The minimum chain partition can be found efficiently by finding 
a maximum matching in a bipartite graph |LP86j . The maximum 
bipartite matching problem is the following |Sch03j : 

Problem 5.2 (maximum bipartite matching). Given a bipar- 
tite graph {U, V, E) where U and V are sets of vertices and -B is a 
set of edges between U and V , i.e., a set of pairs in [/ xV, find a 
maximum bipartite matching MCE, i.e., find a largest subset M 
of E such that 

deg{u,M) = |{e G M : {u,v) = e for some v £ V}\ < 1 
for all li € f/ and 

deg{v,M) = |{e G M : {u,v) = e for some u G U}\ < 1 
for all V (^V. 

The matching is computed in a bipartite graph consisting two 
copies V and V' of the pattern collection V and the partial order 
-<' as edges between V and V' corresponding to the partial order -<. 
Thus, the bipartite graph representation of the pattern collection 

is a triplet {V,V',^'). 

Proposition 5.1. The a matching M in a bipartite graph {V, V , -<') 
determines a chain partition. The number of unmatched vertices in 
V (or equivalently in V' ) is equal to the number of chains. 
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Proof. Let us consider the partially ordered set as a graph {V,<) 
A matching M C^' ensures that the in-degree and out-degree of 
each p S P is at most one. Thus, the set M partitions the graph 
(■p, -<) to paths. By transitivity of partial orders, each path is a 
chain. 

The number of unmatched patterns in V correspond to the min- 
imal patterns of the chains. Each unmatched pattern p € "P is a 
minimal pattern in some chain and if a patterns p is minimal pattern 
in some chain then it is unmatched. As each chain contains exactly 
one minimal pattern, the number of unmatched patterns in V is 
equal to the number of chains in the chain partition corresponding 
to the matching M. □ 

Due to ProDosition l5.ll the number of chains is minimized when 
the cardinality of the matching is maximized. The chain partition 
can be extracted from the matching M in time linear in the cardi- 
nality of V. The partition of a partially ordered pattern collection 
into the minimum number of chains can be computed as described 
by Algorithm 15.11 

A maximum matching M in a bipartite graph ([/, V, E) can be 
found in time 0(v^min{|;7[ , \V\} \E\) |(;al86j . Thus, if the par- 
tial order ^ is known explicitly, then the minimum chain partition 
ca be found in time C(\/jP|H|) which can be bounded above by 
0(|P|^/^) since the cardinality of < is at most \V\ {\V\ - 1) /2. 

The idea of partitioning the graph (P, ^) into the minimum 
number of disjoint paths can be generalized to partitioning it into 
disjoint degree-constrained subgraphs with maximum number of 
matched edges by finding a maximum bipartite 6-matching in- 
stead of a maximum (ordinary) bipartite matching matching. The 
maximum bipartite 6-matching differs from the maximum bipartite 
matching (Problem 15. 2|) only by the degree constraints. Namely, 
each vertex m v ^ U L\V has a positive integer h{v) constraining 
the maximum degree of the vertex: the degree deg{v,M) of v in 
the matching M can be at most b{v). Thus the maximum bipartite 
matching is a special case of the maximum bipartite 6-matching 
with b{v) = 1 for all € [/ U V. 

If there is a weight function w :-<^ M, then the graph (P, -<) can 
be partitioned also into disjoint paths with maximum total weight. 
That is, the pattern can be partitioned into disjoint chains in such a 
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Algorithm 5.1 A minimum chain partition. 

Input: A pattern cohection V and a partial order -< over V. 

Output: Partition of V into the minimum number m of chains 

Cl ) • • • ) Cjn ■ 

1: function Partition-into-Chains(P, ^) 

2: M ^ Maximum- Matching(P, V, ~<) 

3: m ^ 

4: for all p £ V do 

5: preii [p] <— p 

6: next \p] <— p 

7: end for 

8: for all {p, p') G M do 

9: next [p] ^ p' 

10: prew [p'] <— p 

11: end for 

12: for all p V,p = prev [p] do 
13: m <— m + 1 

14: Cm, ^ {p} 

15: while p 7^ nexi [p] do 

16: p <— nexi [p] 

17: Crn ^ Cr„ U {p} 

18: end while 

19: end for 

20: return (Ci, . . . ,Cm.) 

21: end function 



way that the sum of the weights of consecutive patterns in the chains 
is maximized. This can be done by finding a maximum weight bi- 
partite matching that differs from the maximum bipartite matching 
(Problem 15. 2j) by the objective function: instead of maximizing the 
cardinality \M\ of the matching M, the weight X^eg^/ w{e) of the 
edges in the matching M is maximized. 

However, there are two traits in partitioning the partially or- 
dered pattern collections into chains: pattern collections are often 
enormously large and the partial order over the collection might be 
known only implicitly. 

Due to the problem of pattern collections being massive, find- 
ing the maximum bipartite matching in time 0{y^\V\ |-<|) can be 
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too slow. This problem can be overcome by finding a maximal 
matching instead of a maximum matching. A maximal matching 
in {VjV' , -<') can be found in time C'd'Pl + |^|) as shown by Algo- 
rithm 15.21 

Algorithm 5.2 A greedy algorithm for finding a maximal matching 
in a bipartite graph {V,T", ^'). 
Input: A bipartite graph {V,V' ,^'). 
Output: A maximal matching M in the graph. 
1: function Maximal-Matching('P, P', -<') 
2: for allp eV do 
3: prev [p] <— p 

4: next [p] ^ p 

5: end for 

for all {p,p') €-<' do 

if p = next Ip] and p' = prev [p'] then 
next Ip] <— p' 
9: next \p'] <— p 

10: end if 



11 
12 
13 
14 
15 
16 
17 
18 



end for 

M ^ 

for all p ^ VjP ^ next [p] do 
M ^ MU{{p, next [p])} 
next \p] <— p 

end for 

return M 
end function 



It is easy to see that the cardinality of a maximal matching is at 
least half of the cardinality of the maximum matching in the same 
graph. Unfortunately, this does not imply any non-trivial approxi- 
mation quality guarantees for the corresponding chain partitions. 

Example 5.6 (minimum and minimal chain partitions by 
maximum and maximal matchings). Let us consider the pat- 
tern collection {1, 2, . . . , 2n} with partial order 

-<= {{i,j) :i<j}. 

The maximum matching 

{(l,2),(2,3),...,(2n-l,2n)} 
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determines only one chain 

C = {l,2,...,2n} 

whereas the worst maximal matching 

{(l,2n),(2,2n-l),...,(n,n + l)} 

determines n chains 

Ci = {l,2n},C2 = {2,2n-l},...,C„ = {n,n + l}. 

Thus, in the worst case the chain partition found by maximal 
matching is 17^1/2 times worse than the optimal chain partition 
found by maximum matching. □ 

The quality of the maximal matching, i.e., the quality of the 
minimal chain partition can be improved by finding a total order 
conforming the partial order. If the partial order is known explic- 
itly, then a total order conforming it can be found by topological 
sorting in time 0(|-<| + \'P\). Sometimes there is a total order that 
can be computed without even knowing the partial order explic- 
itly. For example, frequent itemsets can be sorted with respect to 
their cardinalities. This kind of ordering can reduce the number 
of chains in the chain partition found by maximal matchings con- 
siderably. The amount of the improvement depends on how well 
the total order is able to capture the essence of the partial order 
(whatever it might be). 

Example 5.7 (improving maximal matching using a total 
order). Let us consider the pattern collection and the partial or- 
der of Example 15.61 If the patterns in the collection {1, 2, . . . , 2n} 

are ordered in ascending or in descending order, then the maximal 
matching agrees with the maximum matching, i.e., the minimal 
chain partition agrees with the minimum chain partition. □ 

If the partial order is given implicitly, as a function that can 
be evaluated for any pair of patterns p,p' € V, then the explicit 
construction of the partial order relation -< might itself be a major 
bottleneck of the chaining of the patterns. The brute force con- 
struction of the partial order -<, i.e., testing of all pairs of patterns 
in V requires 0(|P|^) comparisons. In the worst case this upper 
bound is tight. 
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Example 5.8 (the number of comparison in the worst case). 

Let the pattern collection V be an antichain with respect to a partial 
order -<. Then all patterns in V must be compared with all other 
patterns in V in order to construct -< explicitly, i.e., to ensure that 
all patterns in V are incomparable with each other and thus that 
V indeed is an antichain. □ 

Fortunately, the partial order relations have two useful properties 
that can be exploited in the construction of -<, namely transitivity 
and antisymmetry holding for any partial order relation ^. Due to 
transitivity, p ^ p' and p' ^ p" together imply p ^ p", and anti- 
symmetry guarantees that the graph {V, -<) is acyclic. The partial 
order ■< can be computed also as a side product of the construc- 
tion of a chain partition as shown for minimal chain partitions by 
Algorithm 

Although Algorithm 15. HI needs time 0(|'P| ) in the worst case, 
\Ci\ comparisons are always sufficient to decide whether a pattern 
p ^ V can be added to Cj. Furthermore, the number of comparison 
can be reduced to 1-|- [log2 |Cj|J comparisons if Ci is represented as, 
e.g., a search tree instead of a linked list. The number of compar- 
isons can be reduced also by reusing already evaluated comparisons 
and transitivity. Furthermore, there are several other strategies to 
construct the partial order relation -<. The usefulness of different 
strategies depends on the cost of evaluating the comparisons and 
the actual partial order. Thus, it seems that choosing the best 
strategy for constructing the partial order has to be estimated ex- 
perimentally in general. 

Another partition of a pattern collection based on a partial or- 
der is an antichain partition. The problem of finding a minimum 
antichain partition of a partially ordered pattern collection V can 
be formulated as follows: 

Problem 5.3 (minimum antichain partition). Given a pattern 
collection V and a partial order -< over V, find a partition of V into 
the minimum number of antichains , . . . , • 

Solving Problem 15.31 is relatively easy based on Theorem 15.21 
Algorithm 15.41 solves the problem in the case of arbitrary pattern 
collections. 

In many cases the minimum antichain partition can be found 
even more easily. For example, the minimum antichain partition 
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Algorithm 5.3 An Mlgorillim lo find a uiinimMl chain pari il ion. 
Input: A pattern collection V and a partial order -< over V. 
Output: A minimal chain partition Ci, . . . ,Cm of the pattern col- 
lection v. 

function Minimal-Partition-into-Chains(P, -<) 
m 

for all p € V do 

prev \p] p 
next \p\ ^ p 

i ^ 1 

while i < m and p = prev [p] = next \p\ do 
ii p ~< minCj then 
next Ip] ^ min Ci 
prev [minCj] <— p 
else if max Ci <p then 
prev \p] ^ maxCi 
next [max C^] <— p 
end if 

p' <— prev [max {p" E Ci : p ^ p"}] 
if p = prev \p] = next [p] and p' ~< p then 

next [p] ^ next \p'] 
prev \p] <— p' 
next \p'] p 
prev [next \p\] p 
end if 

if p 7^ prev [p] or p ^ next \p] then 

d^CU {p} 
end if 
i ^ i + l 
end while 

ii p = prev \p] = next \p] then 
m <— m + 1 

Cm ^ {P} 

end if 
end for 

return (Ci, . . .,Cm) 
end function 
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Algorithm 5.4 A minimum antichain partition. 

Input: A pattern collection V and a partial order -< over V. 

Output: Partition of V into the minimum number m of antichains 

1: function Partition-into-Antichains('P, -<) 

2: T" 

3: m ^ 

4: while 7" / do 

5: m ^ m + 1 

6: ^ {P G T'' : p ^ y € ^ p = p'} 

7: P ^ 7" \ ^„ 

8: end while 

9: return (^1, 
10: end function 



of cr-frequent itemsets can be computed in time linear in the sum 
of cardinalities of the cj-frequent itemsets: The length m of the 
longest chain in J- {a, T>) is one greater than the cardinality of the 
largest itemset in the collection. Thus, the collection J^{a, V) can be 
partitioned into m antichains ^i, . . . , Am containing all cr-frequent 
itemsets of cardinalities 0, ...,m — 1, respectively. Clearly, this 
partition can be constructed in time linear in ^X(^j^{aV) \'^\ 
maintaining m lists of patterns. 



5.3 Condensation by Chaining Patterns 

A chain partition of a pattern collection can be more than a mere 
structural clustering if the collection has more structure than a par- 
tial order. One example of such a pattern collection is a transaction 
database without its transaction identifiers. 

Example 5.9 (itemset chains). Let us consider the transaction 
database T> shown as Table I^TI 

Note that the transaction database T> could be represented also 
as a collection of weighted itemsets, i.e., as a collection 

{{1} , {2} , {1, 3} , {2, 4} , {1, 2, 3} , {1, 2, 4}} = {1, 2, 13, 23, 123, 124} 
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Table 5.1: The transaction identifiers and the itemsets of the trans- 
actions in v. 



lid 


X 


lid 


X 


1 


{1} 


9 


{1,3} 


2 


{2} 


10 


{2,4} 


3 


{2} 


11 


{2,4} 


4 


{2} 


12 


{2,4} 


5 


{2} 


13 


{2,4} 


6 


{2} 


14 


{1,2,3} 


7 


{1,3} 


15 


{1,2,3} 


8 


{1,3} 


16 


{1,2,4} 



of itemsets together with a weight function 
{1 1, 2 5, 13 3, 24 4, 123 



w 



2,124 t-> 1}, 



The collection of itemsets representing the transaction database 
V can be partitioned into two chains Ci = {1, 13, 123} and C2 = 
{2,24,124}. □ 

Each chain C of itemsets can be written as a one itemset X by 
adding to each item in the itemsets of C the information about the 
minimum rank of the itemset in C containing that item. That is, a 
chain C = {Xi, X„} such that Xi (Z . . . (Z Xn = Ai . . . A 

m can 

be written as 

Q _ ^^rank{Ai,C) j^rank{Am,C)^ _ j^rank{Ai,C) ^rank{Am,C} 



where 



rank{A,C) 



min rank(X,C) 
AeXGC ^ ' 



for any item A. (Note that the superscript corresponding to the 
ranks serve also as separators of the items, i.e., no other separators 
such as commas are needed.) Furthermore, if there are several 
items G / = {ii, . . . ,*|/|}, with the same rank rank{Ai,C) = 

k, then we can write {A^ : i G I}'^ = . . . , Ai^^^ | instead of 

Af . . . Ai . The ranks can even be omitted in that case if the 
items are ordered by their ranks. 

The quality values of the itemsets in the chain C can be expressed 
as a vector of length \C\ where iih position of the vector is quality 
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value of the itemset with rank i — 1 in the chain. Also, if the 
interestingness measure is known to be strictly increasing or strictly 
decreasing with respect to the partial order, then the ranks can be 
replaced by the quality values of the itemsets. 

Example 5.10 (representing the itemset chains). The itemset 
chains Ci and C2 of Example 15.91 can be written as 

Ci = 102231 

and 

C2 = 1^2°4^ 

where the superscripts are the ranks. The whole transaction data- 
base (neglecting the actual transaction identifiers) is determined if 
also the weight vectors 

w{Ci) = (1,3,2) 

and 

w{C2) = (5,4,1) 

associated to the chains Ci and C2 are given. □ 

From a chain represented as an itemset augmented with the 
ranks of items in the chain, it is possible to construct the origi- 
nal chain. Namely, a rank-A; itemset of the chain 

is 

{Ai : rank{Ai,C) < k,l < i < m} . 

This approach to represent pattern chains can be adapted to a 
wide variety of different pattern classes such sequences and graphs. 
Besides of making the pattern collection more compactly repre- 
sentable and hopefully more understandable, this approach can also 
compress the pattern collections. 

Example 5.11 (condensation by itemset chains). Let an item- 
set collection consists of itemsets 



{0} , {0, 1} , . . . , {0, . . . , n - 1} , {1, . . . , n} , {2, . . . , n} , . . . , {n} . 
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The collection can be partitioned to two chains 

Ci = {{0},{0,l},...,{0,...,n-l}} 

and 

C2 = {{l,...,n},{2,...,n},{n}}. 

The size of each chain is 0(n^) items if they are represented explic- 
itly but only O(nlogn) items if represented as itemsets augmented 
with the item ranks, i.e., as 

Ci = {0°, l\...,{n- 1)"-^} = qOi^ . . . (n - 1)""^ 

and 

C2 = {r-\2"-2,...,n°} = l""i2"-2...n°. 

□ 

Example 5.12 (chain and antichain partitions in the course 
completion database). To illustrate chain and antichain parti- 
tions, let us again consider the course completion database (see 
Subsection I2.2.1|) and especially the 0.20-frequent closed itemsets 
in it (see Example 12. 9|) . 

By Dilworth's Theorem (Theorem 15. 1|) . each antichain in the 
collection gives a lower bound for the minimum number of chains in 
any chain partition of the collection. As itemsets of each cardinality 
form an antichain, we know (see Table ESI) that there are at least 
638 chains in any chain partition of the collection of 0.20-frequent 
closed itemsets in the course completion database. 

The minimum number of chains in the collection is slightly higher, 
namely 735. (That can be computed by summing the values of the 
second column of Table 15.21 representing the numbers of chains of 
different lengths.) The mode and median lengths of the chains are 
both three. 

Ten longest chains are shown in Table ESI (The eight chains of 
length five are chosen arbitrarily from the 36 chains of length five.) 
The columns of the table are follows. The column \C\ corresponds 
to the lengths of the chains, the column C to the chains, and the 
column supp{C,'D) to the vectors representing the supports of the 
itemsets in the chain. 

The chains in Table ESI show one major problem of chaining by 
(unweighted) bipartite matching: the quality values can differ quite 
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Table 5.2: The number of chains of ah non-zero lengths in the 
minimum chain partition of the 0.20- frequent closed itemsets in the 
course completion database. 



the length of chain 


the number of chains 


1 


63 


2 


189 


3 


277 


4 


168 


5 


36 


6 


2 



Table 5.3: Ten longest chains in the minimum chain partition of the 
0.20-frequent closed itemsets in the course completion database. 





C 


supp{C, V) 


6 


I2O2I15213304 {3,5}^^ 


(763,739,616,558,523,520) 


6 


7OIOI325313425 


(1060,570,565,559,501,496) 


5 


{6, 13}° 15^22123 {3, 5}^ 


(625,579,528,507,504) 


5 


{0,12}° 6^13273 {2,3,5}^ 


(690,569,500,495,481) 


5 


I5O0I125334 


(748,678,539,497,491) 


5 


{2,5}° 15^0273 {6, 12}^ 


(992,666,608,575,499) 


5 


OO10I221334 


(2076,788,692,526,510) 


5 


{2, 3}° 6^227^13^ 


(1098,750,601,574,515) 


5 


1O9I520324 


(1587,684,547,489,485) 


5 


{3,15}°2H326354 


(675,668,587, 527,523) 



much inside one chain. This problem can be slightly diminished by 
using weighted bipartite matching where the weight of the edge de- 
pends on how much the quality values of the corresponding itemsets 
differ from each other. This ensures only that the sum of the dif- 
ferences of the quality values of consecutive itemsets in the chains 
is minimized Thus, in long chains the minimum and the maximum 
quality values can still differ considerably. A more heuristic ap- 
proach would be to further partition the obtained chains in such a 
way that the quality values of any two itemsets in the same chain 
do not differ too much from each other. Such partitions can be 
computed efficiently for several loss functions using the techniques 
described in Chapter |31 The minimality of the chain partition. 
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however, is sacrificed when the chains in the partition are further 
partitioned. 

A simple minimum antichain partition of the collection of 0.20- 
frequent itemsets in the course completion database is the partition 
of the itemsets by their cardinalities (see Table EHl)- Especially, the 
0.20-frequent items (Table form an antichain in the collection 
of 0.20-frequent itemsets in the database. The frequent items can 
be considered as a simple summary of the collection of all frequent 
itemsets and the underlying transaction database, too. 

Also the antichains can contain itemsets with very different qual- 
ity values. Again, this problem can be diminished by further par- 
titioning each antichain using the quality values of the patterns. 
□ 

We evaluated the condensation abilities of pattern chaining ex- 
perimentally by chaining closed fi-frequent itemsets of the IPUMS 
Census and Internet Usage databases for several different minimum 
frequency thresholds a £ [0, 1] . We chained the itemsets optimally 
by finding a maximum bipartite matching in the corresponding 
bipartite graph (Algorithm 15. 1|) and in a greedy manner (Algo- 
rithm ESI) when the itemsets were ordered by their cardinalities. 

As noticed in Example 15.51 the number of chains is bounded 
above by the cardinality of the pattern collection and below by the 
number of maximal patterns in the collections. In the case of closed 
cT-frequent itemsets this means that the number of chains is never 
greater than the number of closed u-frequent itemsets and never 
smaller than the number of maximal fi-frequent itemsets. Further- 
more, the lower bound given by the maximal itemsets might not be 
very tight: 

Example 5.13 (slackness of lower bounds determined by 
maximal itemsets). If the collection of closed o"- frequent itemsets 
in V is 

TC{a,V) = 2^ = {X CI} 
then the collection of maximal ir-frequent itemsets in T> is 

TM{a,V) = {1} 



but largest antichain A in J^C{a,'D) consists of all itemsets of car- 
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dinality /2j. Thus, the cardinality of A is 




The chaining of closed cr-frequent itemsets was computed for 
many different minimum frequency thresholds a G [0, 1] . The re- 
sults are shown in Figure EH and in Figure 1^31 The upper figures 
show the minimum frequency thresholds against the number of pat- 
terns. Each curve corresponds to some class of patterns expressed 
by the label of the curve. The lower figures show the minimum 
frequency thresholds against the relative number of closed frequent 
itemsets and chains with respect to the number of maximal frequent 
itemsets. 

The number of chains in experiments were smaller than the num- 
ber of closed frequent itemsets. Thus, the idea of finding a minimum 
chain partition seems to be useful for condensation. It is also worth 
to remember that the fundamental assumption in frequent item- 
set mining is that not very large itemsets are frequent since also all 
subitemsets of the frequent itemsets are frequent. This implies that 
the chains with respect to the partial order relation subset inclu- 
sion cannot be very long as the length of the longest chain in the 
frequent itemset collection is the cardinality of the largest frequent 
itemset. This observation makes the results even more satisfactory. 

All the more interesting results were obtained when compar- 
ing the minimal and the minimum chain partitions: the greedy 
heuristic produced almost as small chain partitions as the com- 
putationally much more demanding approach based on maximum 
bipartite matchings. (Similar results were obtained also with all 
other transaction databases we experimented.) It is not clear, how- 
ever, whether the quality of the maximal matchings is specific to 
closed frequent itemsets or if the results generalize to some other 
pattern collections as well. 
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Figure 5.2: Pattern chains in Internet Usage data. 
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Figure 5.3: Pattern chains IPUMS Census data. 



CHAPTER 6 



Relating Patterns by Their 
Change Profiles 



To make pattern collections more understandable, it would often 
be useful to relate the patterns to each other. In Chapter [21 the 
relationships between patterns were determined by a partial order 
over the pattern collection. The patterns can be related to each 
other also by their quality values. For example, absolute or rela- 
tive differences between the quality values of the patterns could be 
used to measure their (dis) similarity. It is not immediate, however, 
whether comparing the quality values of two patterns actually tells 
much about their similarity. 

An alternative approach is to measure the similarity between two 
patterns based on how they relate to other patterns. That is, the 
patterns are considered similar if they are related to other patterns 
similarly. This approach depends strongly on what it means to be 
related to other patterns. A simple solution is to consider how the 
quality value of the pattern has to be modified in order to obtain 
the quality values of its super- and subpatterns. 

Example 6.1 (modifying quality values). Two simplest exam- 
ples of modifications of 4){p) to are multiplying the quality 
value 4){p) by the value <i>{p') / (t>{p), and adding to the quality value 
(^(p) the value </>(p') — </>(p). 

In this chapter we restrict the modifications to the first case, 
i.e., modifying the quality value </>(p') of a pattern p' ^ V from 
the quality value (t){p) of a pattern p G P by multiplying (j){p) by 
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ct>{p')/m- n 

These modifications for one pattern can be combined as a map- 
ping from the patterns to modifications. This mapping for a pat- 
tern p is called a change profile ch^ of the pattern p and each value 
ch^{p') is called the change of p with respect to p' G V. To sim- 
plify the considerations, the change profile ch^ is divided into two 
parts (adapting the terminology of |Mit82| IMT97j ) : the specializing 
change profile c/if describes the changes to the superpatterns and 
the generalizing change profile ch^ describes the changes to the sub- 
patterns. When the type of the change profile is not of importance, 
a change profile of X is denoted by ch^ . 

Example 6.2 (specializing change profiles for itemsets). Let 

us consider the transaction database V shown as Table UTTl 

Table 6.1: The transaction identifiers and the itemsets of the trans- 
actions in T>. 



tid 


X 


1 


{A} 


2 


{AC} 


3 


{A,B,C} 


4 


{B,C} 



The collection of 1/ 4-frequent itemsets in D and their frequencies 
are shown in Table HT^ 

Table 6.2: The frequent itemsets and their frequencies in P. 



X 


fr{X,V) 





1 


A 


3/4 


B 


1/2 


C 


3/4 


AB 


1/4 


AC 


1/2 


ABC 


1/4 



For the itemsets and the frequencies, the changes in the special- 
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izing change profiles are of form 

chf{Y) 



X,,,, fr{XUY,V) 



fr{X,V) 



Thus, the speciahzing change profiles of the singleton itemsets 
A, B and C of the 1/4-frequent itemsets in V are determined by 
the changes 

ch^ = \b^-,c^-,bc^-\ = [b,bc^-,c^- 

\ 3 3' 3j \ ' 3' 3 

c/if = i ^1 = |a,^C^ ^,C7^ l| and 

c/if = \a^-,b^-,ab^-\ = \a,b^-,ab^-\ . 

\ 3' 3' 31 1 ' 3' 31 



□ 



The change profiles attempt to reach from a local description of 
data, i.e., a pattern collection, to more global view, i.e., to relation- 
ships between the patterns in the collection. The change profiles 
can be used to define similarity measures between the patterns, 
to score the patterns and also in the condensed representations of 
pattern collections. 

In this chapter, we introduce the concept of change profiles, a 
new representation of pattern collections that pursues to bridge the 
gap between local and global descriptions of data. We describe sev- 
eral variants of change profiles and study their properties. We con- 
sider different approaches to cluster change profiles and show that 
they are NP-hard and inapproximable for a wide variety of dissimi- 
larity functions for change profiles, but that in practice change pro- 
files can be used to provide reasonable clusterings. Furthermore, we 
suggest representing a pattern collection using approximate change 
profiles and propose algorithms to estimate the quality values from 
the approximate change profiles. 

This chapter is based on the article "Change Profiles" |Mie03bj . 
In the remaining of the chapter we shall focus on frequent item- 
sets; change profiles can readily be generalized to arbitrary pattern 
collections with a partial order. 
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6.1 From Association Rules to Change Profiles 

The frequency fr(X,'D) of a frequent itemset X in a transaction 
database V can be interpreted as the probabihty ¥{X) of the event 
"a transaction drawn randomly from the transaction database P 
contains itemset X" and the accuracy acc{X =^ Y,!)) of an asso- 
ciation rule X ^ Y as the conditional probability P(y|X). Thus, 
each association rule X ^ Y describes one relationship of the item- 
set X to other itemsets. (Empirical conditional probabilities of also 
different kinds of events have been studied in data mining under the 
name of cubegrades jlKA02j .) 

A more global view of the relationships between the frequent 
itemset X and other frequent itemsets can be obtained by combin- 
ing the association rules X ^Y with common body into a mapping 
from the frequent itemsets to the interval [0, 1] . This mapping is 
called a specializing change profile: 

Definition 6.1 (specializing change profiles). A specializing 
change profile of a fj-frequent itemset X in P is a mapping 

chf : {Y QI : XUY £ .F(fT,P)} ^ [0,1] 

consisting the accuracies of the cr-frequent rules X ^ Y in V, i.e., 

ch^iY) = f'(^^^^^^ 
"^^^""^ fr{X,V) 

where XUY £ J^{a,V). 

A specializing change profile ch^ can be interpreted as the con- 
ditional probability P(Y|X) where Y is a random variable. 

Example 6.3 (specializing change profiles). Let us consider 
the collection J- {a, T)) of the a- frequent itemsets in T> with a = 
1/4 where T> is as shown in Table lUTTl and the u-frequent itemsets 
and their frequencies as shown in Table W7I\ Then the specializing 
change profiles of J- {(J, T)) are 

c/i® = \$^i^A,C ^-,B,AC,BC ^-,AB,ABC ^- 
\ ' ' 4' ' ' 2' ' 4 

chj = U,A^l,C,AC ^'^,B,AB,BC,ABC 
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chf = kB,C.BC^i,A.AB.AaABC»\ 



\ 2 1 

c/if = H,C ^l,A,B,AC,BC ^ -,AB,ABC ^ - 

I o o 

chj^ = {i^,A,B,C,AB,AC,BC,ABC ^ 1} , 
chf^ = !^i!},A,C,AC ^1,B,AB,BC,ABC 

chf^ = Id), B, C, BC ^l,A,AB, AC, ABC and 



2 



chf^c = {iD,A,B,C,AB,AC,BC,ABC ^ 1} . 



□ 



Similarly to the specializing change profiles, we can define a 
change profile to describe how the frequency of a cr-frequent itemset 
X changes when some items are removed from it. A change profile 
of this kind is called a generalizing change profile: 

Definition 6.2 (generalizing change profiles). A generalizing 
change profile of a cr-frequent itemset X in P is a mapping 



chf : T{a, V) 



1 

1,- 
a 



consisting of the inverse accuracies of the frequent rules X\Y =^ X, 
where y C T. 

The generalizing change profile ch^ corresponds to the mapping 
1/P(X|X \ Y) where Y is a random variable. 

Example 6.4 (generalizing change profiles). Let us consider 
the collection T{ct,T)) of the fj-frequent itemsets in T> with a = 
1/4 where V is as shown in Table lUTTl and the cr-frequent itemsets 
and their frequencies as shown in Table ESI Then the generalizing 
change profiles in J-{cr, T)) are 



chl = {%,A,B,C,AB,AC,BC,ABC ^1} , 
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chf = !^^,B,C,BC^1,A,AB,AC,ABC 

chf = {(!>,A,C,AC ^l,B,AB,BC,ABC ^2}, 

ch^ = !^0,A,B,AB ^ 1,C,AC,BC,ABC 
ch^B = {%,C ^l,A,AC ^2,B,BC ^'i.AB.ABC ^ ^ , 
chf^ = Y!},B ^ l,A,C,AB,BC ^^,AC,ABC ^ 2^, 

chf^ = ^D, A, C, AC ^1, B,AB ^^,BC, ABC ^2^ and 
ch^B^ = {%,C ^l,A,B,AC ^2,BC ^'i^AB^ABC ^ 4.} . 

□ 

Each specializing and generalizing change profile ch^ and ch^ 
describe upper and lower neighborhoods 

N,{X) = {X U y G J'{(y,V) -.Y CI} 

and 

Ng{X) = {X\Y e T{a, V) :¥ CI} 

of the frequent itemset X in the collection J^{a,'D), respectively. 
The neighborhood 

N{X) = Ns{X) U Ng{X) 

of X consists of the frequent itemsets Y € Ns{X) that contain the 
frequent itemset X and the frequent itemsets X \ Y & ^g{^) that 
are contained in X, i.e., the frequent itemsets that are comparable 
with X. 

As seen in Example 16.31 and Example 16.41 the change profiles 
(Definition 16. II and Definition 16. 2|) are often highly redundant. This 
is due to the following properties of itemsets: 

Observation 6.1. Let X,Y CI. Then 

XUY = XU{Y\X) 

and 

X\Y = x\{Y nx) . 
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The number of defined values of the change profile is reduced 
(without losing any information) considerably by exploiting Obser- 
vation Kill 

Example 6.5 (redundancy in change profiles). Let X he, a. 

frequent itemset with only one frequent superitemset XU{^} where 
A^ X. There are 2'"^!+^ subitemsets oiX\j{A}. The first equation 
in Observation 16.11 implies that frequency of X U y is equal to the 
frequency of X if K C X. Thus, the specializing changes ch^ (Y) = 
1 for all y C X can be neglected. Furthermore, the specializing 
changes ch^{Y U {A\) are equal for all y C X and it is sufficient 
to store just the specializing change ch^{{A}). This reduces the 
size of the specializing change profile of X by factor 21"^!"'"^. 

Let X be an arbitrary frequent itemset and let !F{(t,T>). Based 
on the second equation of Observation 16. 11 there is no need to store 
changes ch^ (Y) = 1 for Y <Z I such that Y ^ X. This reduces 
the number of of changes in the generalizing change profile of X by 



The change profiles with redundancy reduced as in Example 16.51 
are called concise change profiles: 

Definition 6.3 (concise specializing change profiles). A con- 
cise specializing change profile cch^ is a restriction of a specializ- 
ing change profile chf to itemsets Y such that X r\Y = $ and 
Xuy G T{a,V). 

Definition 6.4 (concise generalizing change profiles). A con- 
cise generalizing change profile cch^ is a restriction of a generalizing 
change profile ch^ to itemsets Y such that y C X. 

Example 6.6 (concise change profiles). Let us consider the 
collection J^{a,'D) of the c- frequent itemsets in V with a = 1/4 
where V is as shown in Table 16. H and the u-frequent itemsets and 
their frequencies as shown in Table 16.21 The concise specializing 
change profiles of J^{a, V) are 



factor 2^-1^1 = 2l^\^l. 



□ 
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|0 1,^ - j and 
{0^1} 



cchg 



ABC 



and the concise generalizing change profiles of J-{cr, D) are 



cchf^^ = {^,C ^l,A,B,AC ^2,BC ^3,AB,ABC ^A}. 

□ 



The concise change profiles can be interpreted as affine axis- 
parallel subspaces of RI'^'-'^'^-'I (i-e., affine hyperplanes in mI-^('^'^)I) 
that are indexed 

• by itemsets Y such that X n F = and X\JY £ T{cf, V) in 
the specializing case, and 

• by itemsets Y such that y C X in the generalizing case. 

The concise change profiles for a frequent itemset collection J- [a, T)) 
can be computed efficiently by Algorithm 16.11 






{0 ^ 1,5 ^ 2}, 



{%^l,A^2,B ^'i.AB ^ 



4}, 
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Algorithm 6.1 Generation of concise change profiles. 
Input: The collection J^{a,T>) of cr- frequent itemsets in a transac- 
tion database T> and their frequencies. 
Output: The concise specializing change profiles and the concise 
generalizing change profiles of J-{a,T)). 
1: function Change-Profiles(J^(c7, P),/r) 
2: for all X G J^(cj, V) do 
3: for all y C X do 

4: cchf'^''{Y) ^ friX, V)/fr{X \ Y, V) 

5: cc/if (y) ^ fr{X \ y, V)/fr{X, V) 

6: end for 

7: end for 

8: return i^cch^, cchg^ 

9: end function 



As shown in Example 16.71 the neighborhoods of even the concise 
change profiles can be too large. 

Example 6.7 (redundancy in concise change profiles). Let 



X be an itemset in the collection T{a,'D). Then 



cchl 



> 21-^1 and 



|cc/if I > 2l^l in ^(cJ,P). □ 

Thus, the following definitions of association rules, we define 
simple specializing change profiles and simple generalizing change 
profiles: 

Definition 6.5 (simple change profiles). A simple specializ- 
ing (generalizing) change profile sch^ (sch^) is restriction of cchf 
(cch^) to singleton itemsets Y. 

Example 6.8 (simple change profiles). Let us consider the col- 
lection J-'{a, V) of the cj-frequent itemsets in V with cr = 1/4 where 
T> is as shown in Table IHTI and the a-frequent itemsets and their 
frequencies as shown in Table The simple specializing change 
profiles of J^[(T, V) are 

schl = i^A,C^\,B^\ 

sch^ = Ic^l^^i 
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schf = iC^l,A 

2 
3 



schf^ = I A^ I)' and 



sch^" = {C^l] 
1 
2 
1 
2 

sehj^^ = {} 
and the simple generalizing change profiles of J-{(t, V) are 
schl = {}, 
sehf = [a^I 
sch^ = {B^2}, 
seh<^ = {C^l 
schf^ = {A^2,B^3} 



schf^ = {A,C 



schf^ = |c 1,5 ^1 and 
sc/i^-BC ^ {C^1,A,B^2}. 



□ 



The number of bits needed for representing a simple change pro- 
file is at most |X| log [Pj: Each change profile can be described as 
a length- |X| vector of changes as the number of singleton subsets 
of the set X of items is \I\. Each change can be described using at 
most log bits since there are at most as many different possible 
changes from a given itemset to any other itemset as there are are 
transactions in V. This upper bound can sometimes be quite loose 
as shown by Example 16.91 

Example 6.9 (loose upper bounds for simple generalizing 
change profiles). Let the set 2 of items be large. Then the above 
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upper bound is often very loose: The number of itemsets in the 
collection J^{a, T>) of cr-frequent itemsets in T> is exponential in the 
cardinality of the largest itemset in J^{a,'D). Thus, the largest 
itemset X in J- {a, T>) has to be moderately small in order to be able 
to represent the collection ^(cr, 2?) in a reasonable space. Thus, 
in this case, the upper bound for binary description of a simple 
generalizing change profile of an itemset X G ^(o", T>) should rather 
be |X| (log|T|) (log|P|). □ 



6.2 Clustering the Change Profiles 

In order to be able to find groups of similar change profiles, it 
would be useful to be able to somehow measure the (similarity or) 
dissimilarity between change profiles ch^ and chX . 

The dissimilarity between the change profiles ch^ and chX can 
be defined to be their distance in their common domain Dom{ch^)r\ 
Dom{ch^) with respect to some distance function d. A complemen- 
tary approach would be to focus on the differences in the structure 
of the pattern collection, e.g., to measure the difference between 
two change profiles by computing the symmetric difference of their 
domains. This kind of dissimilarity function concentrates solely on 
the structure of the pattern collection and thus neglects the fre- 
quencies. A sophisticated dissimilarity should probably consist of 
both points of view. 

We shall focus on the first one. The only requirements we have 
for a distance function are given by Definition 16.61 

Definition 6.6 (a distance function). A function d is a distance 
function if 

d{ch^, ch^) = ^ ch^{Z) = ch^{Z) 
holds for all Z G Dom{ch^) D Dom{ch^). 

There are several ways to define what is a good clustering and 
each approach has its own strengths and weaknesses |EC02l IKle02j . 
A simple way to group the change profiles based on a dissimilarity 
function defined in their (pairwise) common domains is to allow 
two change profiles ch^ and chX to be in the same group only if 
d{ch'^ , ch^) = 0. Thus, the problem can be formulated as follows. 



138 



6 Relating Patterns by Their Change Profiles 



Problem 6.1 (change profile packing). Given a collection Ch 
of change profiles and a dissimilarity function d, find a partition of 
Ch into groups Chi, . . . ,Chk with the smallest possible k such that 
d{ch^, ch^) = holds for all ch^ , ch^ G Chi with l<i<k. 

Unfortunately, the problem seems to be very difficult. Namely, 
it can be shown to be at least as difficult as the minimum graph 
coloring problem: 

Problem 6.2 (minimum graph coloring [ACK+99]). Given a 
graph G = iy, E), find a labeling label : V ^ N of the vertices with 
smallest number \label{V)\ of different labels such that if u,v (z V 
are adjacent then label{u) ^ label{v). 

Theorem 6.1. The change profile packing problem is at least as 
hard as the minimum graph coloring problem. 

Proof. Let G = {V, E) be an instance of the minimum graph col- 
oring problem where V = {vi, . . . ,Vn} is the set of vertices and 
E = {ei, . . . , Cm} is the set of edges. 

We reduce the minimum graph coloring problem (Problem 16. 2|) 
to the change profile packing problem (Problem IH.lj) by first con- 
structing an instance (o", T>) of the frequent itemset mining problem 
and then showing that the collection Ch of specializing change pro- 
files computed from the collection T{ct,T>) of the (j-frequent item- 
sets in T) and their frequencies can be partitioned into k + 2 sub- 
collections Chi, . . . ,C/ifc+2 if and only if the graph G is /c-colorable. 
To simplify the description, we shall consider, without loss of gen- 
erality, simple change profiles instead of change profiles in general. 

The set X of items consists of elements in V yj E. For each 
vertex Vi there are 3n transactions with transaction identifiers 
(i, 1) , . . . , {i, 3n). Thus, in total there are 3n^ transactions in T>. 

Each transaction ,X) contains the vertex Vi € V. Trans- 

actions ((i,3(j — 1) + 1) ,X) and ((i,3(j — 1) + 2) ,X) contain an 
edge {vi,Vj} E if and only if i < j. The transaction {{i,3j) ,X) 
contains an edge {vi, Vj} ^ E if i > j. 

Let the minimum frequency threshold a be l/(3n'^). Then 
the collection J^{a,T>) consists of the empty itemset the single- 
ton itemsets {vi} , ■ ■ ■ , {vn} , {ei} , . • • , {e^} and 2-itemsets {fj, e} 
where Vi (z e G E. Thus, the cardinality of J^{a, T>) is polynomial in 
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the number of vertices of G. The simple change profiles of .F(cj, T>) 
are the following ones: 

schKv) 
schl^{{vi,Vj}) 

Clearly, 

• d{sch'^, schi) > for sc/if and all schi where v & V ^ 

• d{sch'^, schl) > for sch^ and all sch^ where e € E, and 

• d{schl, schg) > for all schg and sch^ where v € V and 
eeE. 

On one hand, no two of c/if, ch^ and sc/i^ can be in the same 
group for any v £ V,e £ E. On the other hand, all sc/i^ can be 
packed into one set Chk+i and sc/i^ always needs its own set Chk+i- 

Hence, it is sufficient to show that the simple specializing change 
proffies schg can be partitioned into k sets Chi, . . . ,Chk without 
any error if and only if the graph G is fc-colorable. No two sim- 
ple specializing change profiles sch^g' and sch"s' with {vi,Vj} € E 
can be in the same group since sch'^^{{vi,Vj}) 7^ schg^ {{vi,Vj}). If 
{vi,Vj} ^ E then Dom{sch^') fl Dom{schl^) = 0, i.e., sch^^ and 
sch^^ can be in the same group. 

As the minimum graph coloring problem can be mapped to the 
change profile packing for specializing change profiles in polynomial 
time, the latter is at least as hard as the minimum graph coloring 
problem. □ 

The minimum graph coloring problem is hard to approximate 
within for any e > unless NP=ZPP FK98,. (Recall that 

the complexity class ZPP consists of the decision problems that 
have randomized algorithms that always make the right decision 
and run in expected polynomial time |Pap95| .) Assuming that the 
graph is connected we get from the above mapping from graphs to 
change profiles the following rough upper bound 

\ch\ = 1 + iy[ + 1^1 < 1 + \v\ + r^j = c(l^l^)- 



l/n 






l/n^ 


iix £E 






if w € e 




2/(3n) 


if i < j,{vi, 




l/(3n) 


Hi > j,{vi, 
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Therefore, the change profile packing problem is hard to approxi- 
mate within l^(|C/i|(^/^^"') for any e > unless NP=ZPP. 

Although the inapproximability results seem to be devastating, 
there are efficient heuristics, such as the first-fit and the best-fit 
heuristics CJCG^02] . that might be able to find sufficiently good 



partitions efficiently. However, the usefulness of such heuristics de- 
pends on the actual transaction databases inducing the collections 
of frequent itemsets. 

The requirement that two change profiles ch^ and ch^ can be 
in the same group Chi only if d{ch^ , ch^) = might be too strict. 
This restriction can be relaxed also by discretizing the frequencies of 
the frequent itemsets or the changes in the change profiles. (Recall 
that in Section 13.21 we have seen that discretizations minimizing 
several different loss functions can be found efficiently.) 

Instead of minimizing the number of clusters, one could minimize 
the error for a fixed number of clusters. This kind of clustering is 
called a k- clustering. The problem of finding good fc-clusterings is 
well-studied and good approximation algorithms are known if the 
dissimilarity function is a metric ^Dasn2, dlVKKEO.S, FG88^ . The 
problem of finding the /c-clustering of change profiles that minimizes 
the sum of intracluster distances can be defined as follows: 

Problem 6.3 (minimum sum of distances /^-clustering of 
change profiles). Given a collection Ch of change profiles, a dis- 
tance function d : Ch x Ch — > M and a positive integer k, find the 
partition of Ch into k groups Chi, . . . ,Chk such that 

^ Y: d{ch^,eh^) 

«=1 ch^,ch^GChi 

is minimized. 

Unfortunately, it turn out that a dissimilarity function that is 
defined to consist of the dissimilarities between the change profiles 
in their common domains cannot be a metric since it cannot satisfy 
even the triangle inequality in general: 

Proposition 6.1. A function d that measures the distance be- 
tween the change profiles ch^ and ch^ in their common domain 
Dom{ch^) n Dom{ch^) is not a metric. 
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Proof. Let ch^ , ch^ and ch^ be three change profiles such that 

Dom{ch^) n Dom{ch^) = = Dom{ch^) n Dom{ch^) 

but d{ch^ , ch^) > (and thus Dom{ch^) n Dom{ch^)). The dis- 
tance between these change profiles do not satisfy triangle inequal- 
ity since 

d{ch^, ch^) > = d{ch^, ch^) + d{ch^ , ch^). 
Thus, such a distance cannot be not a metric. □ 

It turns out that the minimum fc-clustering of specializing change 
profiles is even worse than the change profile packing problem in 
the sense of approximability as combining Theorem 16 . 1 1 and Propo- 
sition EI we get: 

Theorem 6.2. The minimum sum of distances k-clustering (Proh- 
lem \6.cl\) of specializing change profiles cannot be approximated within 
any ratio. 

Proof. If we could approximate A;-clustering of specializing change 
profiles, then we could, by Theorem 16.11 and Proposition 16.11 solve 
the minimum graph coloring problem exactly. Namely, if a graph is 
fc-colorable, then the corresponding change profiles have fe-clustering 
with the sum of intracluster distances being zero. Thus, an approx- 
imation algorithm with any approximation guarantees would find 
a solution with error zero if and only if the corresponding graph is 
fc-colorable. □ 

A major goal in the clustering of the change profiles is to fur- 
ther understand the relationships between the frequent itemsets 
(and collection of interesting patterns in general). As the nature 
of pattern discovery is exploratory, defining a maximum number of 
clusters or a maximum dissimilarity threshold might be difficult and 
unnecessary. Fixing these parameters in advance can be avoided by 
searching for a hierarchical clustering, instead jHTFOlj . 

A hierarchical clustering of Ch is a recursive partition of the 
elements to 1,2,..., \Ch\ clusters. It is most fortunate for the ex- 
ploratory data analysis point of view that in the case of hierarchical 
clustering, the clusterings of all cardinalities can be visualized in the 
same time by a tree (often called a dendrogram). 
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There are two main types of hierarchical clustering: agglomer- 
ative and divisive (see also Example . The first begins with 
\Ch\ singleton clusters and recursively merges them and the latter 
recursively partitions the set Ch. Both are optimal in certain sense: 
each agglomerative (divisive) hierarchical clustering of Ch into k 
groups is optimal with respect to the clustering into A; + 1 groups 
{k — 1 groups) determined by the same agglomerative (divisive) 
hierarchical clustering. 

Example 6.10 (a hierarchical clustering of change profiles). 

Let us consider subsets of the simple change profiles of Example l6.81 
As the distance function between the change profiles, we use the 
sum of absolute distances in the common domain. (For brevity, we 
write the simple change profiles as 3-tuples. The positions denote 
the changes with respect to A, B and C, respectively, * denoting 
undefined value.) 

First, let us consider the simple specializing change profiles 

schj = 

schf = ^2'*'"*^^ ^^^^ 

sch^ = (-,-,*). 

\3 3 / 

The sums of the absolute differences in their common domains are 



d{schf,schf) = 
d{sch^,sch^) = 
d{schf , sch'^) = 



schf{C) - schf{C) 
schf{B) - sch^{B) 
schf{A) - sch^{A) 
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1 




- 1 
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Agglomerative and divisive hierarchical clusterings suggest both 
that the clustering into two groups is {sc/i^} and {schf , sch^^ 
with the sums of distances and 1/6, respectively. 

Second, let us consider the simple generalizing change profiles 

schf"" = (2,3,*), 
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sch 
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and 



The sums of the absolute differences in their common domains are 



d{sch^^, schf^) 



AB 



d{schg 
d{sch^^ 



sch 



BC\ 



\schf^iA) - schf^{A)\ 
\schf^{B)-schf^{B)\ 



sch^^ 



schj^iC) - schf-iC) 



BC/ 
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2' 
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This time there are two equally good clusterings to two groups: the 
only requirement is that sch^^ and sch^'^ are in different clusters. 
The sums of the distances for the singleton cluster and the cluster 
of two change profiles are and 1/2, respectively. 

The dendrogram visualizations of the hierarchical clusterings are 
shown in Figure UTTl □ 



^1 



sch\ 



sch^ri 



schr 



sch". 



sch% 



Figure 6.1: The dendrogram of the hierarchical clusterings of the 
simple specializing change profiles sch\, sch% and sch^, and the 
simple generalizing change profiles sch^j^^, ^^^ac ^^^bc^ 
spectively. The y-axis corresponds to the sum of absolute errors. 



The divisive strategy seems to be more suitable for clustering the 
change profiles since the dissimilarity functions we consider are de- 
fined to be distances between the change profiles in their (pairwise) 
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common domains: The agglomerative clustering first puts more or 
less arbitrarily the change profiles with disjoint domains into the 
clusters. The choices made in the first few merges can cause ma- 
jor differences in the clusterings into smaller number of clusters, 
although the groups of change profiles with disjoint domains are 
probably quite unimportant for determining the complete hierar- 
chical clustering. Contrary to the agglomerative clustering, the 
divisive clustering concentrates first on the nonzero distances and 
thus the change profiles with disjoint domains do not bias the whole 
hierarchical clustering. 

Example 6.11 (hierachical clustering of the simple special- 
izing change profiles of the 34 most frequent courses in 
the course completion database). To illustrate the hierarchical 
clustering of change profiles, let us consider the simple specializing 
change profiles of the 34 most frequent items (i.e., the courses shown 
in Table 1221) in the collection consisting of all 1- and 2-subsets of 
the 34 most frequent items in the course completion database (see 
Subsection IT^ . 

The agglomerative clustering of the simple change profiles using 
the average distances between the courses as the merging criterion 
(i.e., the average linkage hierarchical clustering) is shown in Fig- 
ure O 

The clustering of the specializing change profiles captures many 
important dependencies between the courses. For example, the 
courses 8 (English Oral Test) and 11 (Oral and Written Skills in 
Swedish) are close to each other. Also, the courses 16 (Approbatur 
in Mathematics I) and 32 (Approbatur in Mathematics II) are in 
the same branch although their ranking with respect to their fre- 
quencies differ quite much. Furthermore, the courses 18 (Discrete 
Mathematics I) and 27 (Logic I) are close to each other as their con- 
tent overlap considerably and they form two thirds of an alternative 
for the courses 16 and 32 to obtain Approbatur in Mathematics. 

The courses 14 (Scientific Writing) and 20 (Maturity Test in 
Finnish) are naturally close to each other since it is very customary 
to take the maturity test in the end of the Scientific Writing course. 
Also the course 31 (Software Engineering Project) is close to the 
course 14. The explanation for this is that both courses have almost 
the same prerequisites and both are needed for the Bachelor of 
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Figure 6.2: The hierarchical clustering of the 34 most frequent items 
based on their simple specializing change profiles. 
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Science degree with Computer Science as the major subject. 

The courses 5 (Information Systems) and 10 (Programming in 
Pascal) are deprecated and they replaced in the current curriculum 
by the courses 21 (Introduction to Application Design), 23 (In- 
troduction to Databases), 24 (Introduction to Programming) and 
22 (Programming in Java). Similarly, the course 12 (Information 
Systems Project) has been replaced by the course 33 (Database 
Application Project). The courses close to the course 12, namely 
the courses 13 (Concurrent Systems), 15 (Databases Systems I) and 
30 (Data Communications), are also deprecated versions although 
there are courses with the same names in the current curriculum. 

As the simplest comparison, the clustering of items based on 
the absolute differences between their frequencies is shown in Fig- 
ure 16.31 However, the clustering based on frequencies does not 
capture much of the relationships between the courses. This is not 
very surprising since the frequencies of the courses contain quite 
little information about the courses. 

A more realistic comparison would be the average linkage hier- 
archical clustering based on the Hamming distances between the 
items. The Hamming distance between the two items in a transac- 
tion database is the number transactions in the database containing 
one of the items but not both of them, i.e., the Hamming distance 
between items A and B in a transaction database V is 

dH{A,B,V) = \cover{A,V)\cover{B,V)\ + 
\cover{B,'D) \ cover{A,'D) \ . 

Such a clustering is shown in Figure 

The results obtained using Hamming distance are quite similar 
to the results obtained using the change profiles. There are slight 
differences, however. For example, the courses 0, 1 and 28 that are 
close to each other in Figure HOI are quite far from each other in 
in Figure 16.41 The courses 17 and 31 are close to each other in 
Figure 16. 4| whereas the course 31 is in the same cluster with the 
courses 14 and 20 in Figure 

The courses 18, 26 and 27 form a cluster in Figure in31 forming 
an alternative Approbatur in Mathematics but in Figure 16.21 the 
course 26 is together with the course 19 which is mathematically 
demanding for many students. (In Figure 16.41 the course 19 is in 
the same cluster with the course 33.) 
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Figure 6.3: The hierarchical clustering of the 34 most frequent items 
based on the absolute differences between their frequencies. 
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Figure 6.4: The hierarchical clustering of the 34 most frequent items 
based on their Hamming distances. 
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In general, the hierarchical clustering with respect to Hamming 
distances seems to capture courses forming entities (for example, 
pairs of courses that earlier formed one course) , whereas the hierar- 
chical clustering of change profiles seems to be related more closely 
to the essence of the courses in a broader way. This is in line with 
the fact that the Hamming distances between the items compare 
the co-occurrences of the items directly, whereas the distances be- 
tween the change profiles measure the similarity of the behavior of 
the items with respect to their whole neighborhoods except each 
other. Note that the change profiles do not depend on the actual 
frequencies of the items but the Hamming distances are strongly 
affected by the frequencies. 

The Hamming distance is symmetric with respect to whether 
the item is contained in the transaction. As the transaction data- 
bases often correspond to sparse binary matrices, this assumption 
about the symmetry of presence and absence is not always justified. 
The similarity between two items could be measured by the num- 
ber of transactions containing them both instead of counting the 
number of transactions containing either both or neither of them. 
This is also equal to the scalar product between the binary vectors 
representing the covers of the items. To transform similarity to dis- 
similarity, we subtract the similarity value from the cardinality of 
the database. Thus, the dissimilarity is 

ID I — \ cover{A,'D) n cover{B,V)\ = supp{^,V) — supp{AB ,T>). 

The hierarchical clustering for this dissimilarity is shown in Fig- 
ure 16.51 The results are unfortunately similar to the clustering 
based on the frequencies (Figure 16. 3|) although also some related 
courses, such as the courses 16 and 32, are close to each other in 
the dendrogram regardless of their dissimilar frequencies. 

The change profiles used in the clustering in Figure 16.21 can be 
computed from the frequencies of the 34 most frequent items (Fig- 
ure I6.3jl and the frequencies of the 2-itemsets formed from the 34 
most frequent itemsets fFigure . Thus, in this particular case, 
the specializing simple change profiles can be considered as normal- 
izations of the frequencies of the 2-itemsets by the frequencies of 
the items. Another approach to normalize the frequencies of the 2- 
itemsets by the frequencies of the items is as follows. The supports 
of the 2-itemsets can be considered the scalar products between 
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Figure 6.5: The hierarchical clustering of the 34 most frequent items 
based on the their scalar products. 
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Figure 6.6: The hierarchical clustering of the 34 most frequent items 
based on their cosine distances. 
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the the items. By normahzing the scalar product by the euchdean 
lengths of the vectors corresponding to the covers of the items, we 
get the cosine of the angle between the vector. The cosine of the 
angle between two (non-zero) binary vectors is always in the inter- 
val [0, 1]. Thus, the cosine distance between two items A and B in 
a transaction database P is 

\cover{A,'D) n cover{B,'D)\ 
y^\cover{A,V)\y^\cover{B,V)\ 
supp{AB,V) 
supp{A, V) supp{B, V) 

supp{AB, V)/supp{^, V) 
\J supp(A, T>)\J supp{B, V)/ supp (0,2^) 

HAB,V) 
^fr{A,V)fr{B,V) 

The hierarchical clustering of the 34 most frequent items based co- 
sine distances is shown in Figure HOtI The clustering is very close 
to the one shown in Figure FTH The main difference between these 
two clusterings is that in the clustering shown in Figure 16.41 the 
courses and 1 are very different to everything (including each 
other), whereas the clustering shown in Figure IHSl grasps the simi- 
larity between the courses 0, 1, 8 and 11. □ 



(lkos{A,B,V) = 1- 

= 1 - 

= 1 - 

= 1 - 



6.3 Estimating Frequencies from Change 
Profiles 

The change profiles can be used as a basis for condensed representa- 
tions of frequent itemsets. Furthermore, several known condensed 
representations can be adapted to change profiles. One interesting 
approach to condense the change profiles (and thus the underlying 
pattern collections, too) is to choose a small set of representative 
change profiles (using, e.g., hierarchical clustering) and replace the 
original change profiles by the chosen representatives. Then the 
frequencies of the frequent itemsets can be estimated from the ap- 
proximate change profiles. 

Representing the frequencies of the frequent itemsets by approx- 
imate change profiles can be seen as a condensed representation of 
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the collection of frequent itemsets as the approximate change pro- 
files can (potentially) fit into smaller space than the exact change 
profiles or even the frequent itemsets. Also, the condensed rep- 
resentations can be applied to further condense the approximate 
change profiles. 

In addition to the fact that the frequencies can be estimated from 
the approximate change profiles, the change profiles themselves can 
benefit from the frequency estimation. Namely, the quality of the 
approximate change profiles can be assessed by evaluating how well 
the frequencies can be approximated from them. 

For the rest of the section we consider only the case where no 
change profile is missing but the changes are not exact. The meth- 
ods described in this section can be generalized to handle missing 
change profiles and missing changes. 

Given the approximations of the change profiles for the collection 
J-'{a, V) of the cr-frequent itemsets in P, it is possible to estimate the 
frequencies of the itemsets in J- {a, T>) from the approximate change 
profiles. The estimation can be done in many ways and the quality 
of each estimation method depends on how the approximations of 
the change profiles are obtained. 

Next we describe an approach based on the estimates given by 
different paths (in the graph determined by the changes of the 
change profiles) from the empty itemset to the itemset X whose 
frequency is under estimation. Especially, we concentrate on com- 
puting the average frequencies given by the paths from to X. 
The methods are described using simple specializing change pro- 
files, but their generalization to other kinds of change profiles is 
straightforward. 

Without loss of generality, let X = {1, . . . , A;}. In principle, we 

could compute the frequency estimate fr{X) of the itemset X in 
J-'{cr, T)) , the average of the frequencies suggested by paths from 
to X. Let life be the collection of all permutations of {1, . . . , fc}. 
Then the frequency estimate can be written as 

k 

^'"(^^ = m^ ^c/^f (^(1)) n sch<'-'\Ai)). (6.1) 

The main practical difficulty of this formula is the number of 
paths: The number of paths from to X is equal to the number of 
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permutations of items in X, i.e., the number of paths from to X 
is 1^1 !. This can be super polynomial in \ J^{a,'D)\. 

Example 6.12 (the number of paths given by simple change 
profiles is superpolynomial) . Let J^{cr, T)) consist of an itemset 
X and all of its subitemsets. Then 

|^(a,P)| = 2l^l 

and 

^•yr I 

\x\\ = ./2^\(^^^ (i + e(|xr^)). 



Hence, 

|X|! 



1^ 

-1 



1 + G(IX| 



which is clearly exponential in □ 

The frequency estimate fr{X) of X as the average over all paths 
from to X can be computed much faster by observing that the 
frequency of X is the average of the frequencies of the itemsets 
X \ {A} ,A£ X, scaled by the changes schf'^^'^^iA}), i.e., 

M^) = M ^ fr{Y)schJ{X\Y). 

This observation readily gives a dynamic programming solution 
described as Algorithm 16.21 

As the frequency estimate has to be computed also for all subsets 
of X and the frequency estimate of Y can be computed from the 
frequency estimates of the subsets of Y in time the time 

complexity of Algorithm 16.21 is 

0(|X|2l^l) = 0(|.F((T,P)|log|.F(a,P)|). 

Even this can be too much for a restive data analyst. The esti- 
mation can be further speeded up by sampling uniformly from the 
paths from to X as described by Algorithm 16.31 

The time complexity of Algorithm 16.31 is 0{k \X\) where k is 
the number of randomly chosen paths in the estimate. Note that 
the algorithm can be easily modified to be an any-time algorithm. 
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Algorithm 6.2 A dynamic programing solution for frequency es- 
timation from (inexact) change profiles. 

Input: An itemset X and the simple specializing change profiles 
(at least) for X and all of its subsets. 

Output: The frequency estimate fr{X) of X as described by Equa- 
tion 

1: function DP-FROM-SCHS(A, schs) 

2: /r(0) ^ 1 

3: for i = 1, . . . , \X\ do 

4: for all y C A, |y| = i do 

5: fr{Y) = 

6: for all Z dY, \Z\ = |y| - 1 do 

7: fr{Y) ^ fr{Y) +fr{Z)schf{Y \ Z) 

8: end for 

9: fr{Y)^fr{Y)/\Y\ 
10: end for 

11: end for 
12: end function 



Algorithm 6.3 A randomized algorithm for frequency estimation 
from (inexact) change profiles. 

Input: An itemset A, the simple specializing change profiles (at 

least) for X and all of its subsets, and a positive integer k. 
Output: An estimate of the frequency estimate fr{X) of X as 
described by Equation lb. II 
function Sample-from-SCHS(A, schs, k) 

im - 1 



1 

2: 

3 

4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 



/r(A) ^ 

for J = 1, . . . , /c do 

y ^ 

for i = 1, . . . , |Al — 1 do 

A ^ Random-Element(A \ Y) 
friYU{A})^friY)sch^i{A}) 
Y ^YU{A} 
end for 

fr{X) ^ fr{X) + friY)schJ{X \ Y) 
end for 

fr{X) ^ fr{X)/k 
end function 
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This would sometimes be useful in interactive data mining and for 
resource bounded data mining in general. 

Algorithm 16 . 21 and Algorithm can be adapted to other kinds 
of estimates, too. Especially, if upper and lower bounds for the 
changes schj (A) are given for all y C X such that A ^ X \Y , 
then it is possible to compute the upper and lower bounds for the 
frequency of X for all itemsets X reachable from by changes of 
the change profiles. Namely, the frequency of the itemset X is at 
most the minimum of the upper bound estimates and at least the 
maximum of the lower bound estimates determined by the change 
paths from to X. 



6.4 Condensation by Change Profiles 

The usefulness of approximate change profiles, the stability of the 
frequency estimation algorithms proposed in Section FO.^I and the ac- 
curacy of the path sampling estimates were evaluated by estimating 
frequencies from noisified simple specializing change profiles in the 
transaction databases Internet Usage and IPUMS Census (see Sub- 
section 12.2. 1|) . 

In order to study how the estimation methods (i.e.. Algorithm ic. 21 
and Algorithm iniSI) tolerate different kinds of noise, the simple spe- 
cializing change profiles were noisified in three different ways: 

• randomly perturbing the changes of the change profiles by =be, 

• adding uniform noise from the interval [— e, e] to the changes 
of the change profiles, and 

• adding Gaussian noise with zero mean and standard deviation 
e to the changes of the change profiles. 

The changes of the noisified change profiles were truncated to the 
interval [0, 1] since, by the definition of specializing change profiles 
(Definition l6.1|) . the changes in the specializing change profiles must 
be in the interval [0, 1]. 

We tested the dependency of the approximation on the number 
of sample paths by evaluating the absolute difference between the 
correct and the estimated frequencies for the dynamic programming 
solution corresponding to the average frequency estimate over all 



6-4 Condensation by Change Profiles 



157 



paths, and the sample solution corresponding to the average fre- 
quency estimate over the sampled paths. The experiments were re- 
peated with different number of randomly chosen paths, minimum 
frequency thresholds and noise levels e. 

The results for Internet Usage data with minimum frequency 
threshold 0.20 are shown in Figures 16.71 16.91 and 16.111 and for 
IPUMS Census data with minimum frequency threshold 0.30 are 
shown in Figures 16.81 16.101 and UTT^ with noise level e = 0.01. The 
each of the curves are averages of 1000 random experiments. The 
results were similar with the other minimum frequency thresholds, 
too. 

The results show that already a quite small number of random 
paths suffices to give frequency approximations closed to the dy- 
namic programming solution. Furthermore, the average absolute 
errors achieved by dynamic programming were relatively small, es- 
pecially as the errors in the changes cumulate multiplicatively as 
the frequencies are estimated as the paths. 
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Figure 6.7: Internet Usage data, Gaussian noise. 
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Figure 6.8: IPUMS Census data, Gaussian noise. 
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Figure 6.9: Internet Usage data, perturbation. 
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Figure 6.10: IPUMS Census data, perturbation. 
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Figure 6.11: Internet Usage data, uniform noise. 
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Figure 6.12: IPUMS Census data, uniform noise. 



CHAPTER 7 



Inverse Pattern Discovery 



The problem of discovering interesting patterns from data has been 
studied very actively in data mining. (See, e.g., ChapterElfor more 
details.) For such a well-studied problem, it is natural to study 
also the inverse version of the problem, i.e., the problem of the 
inverse pattern discovery. That is, to study the problem of finding a 
database compatible with a given collection of interesting patterns. 

In addition of being an important class of tasks for data mining 
as a scientific discipline, inverse pattern discovery might also have 
some practical relevancy. 

First, the existence of databases compatible with the given col- 
lection of patterns is usually highly desirable since the collection of 
interesting patterns is often assumed to be a summary of some da- 
tabase. Deciding whether there exists a database compatible with 
the given collection of interesting patterns (and their quality val- 
ues) can be considered as a very harsh quality control. An efficient 
method for answering to that question could have also some prac- 
tical implications to pattern discovery since several instances are 
not willing to share their data but might sell some patterns claim- 
ing they are interesting in their database. (If interaction with the 
pattern provider would be allowed, then also, e.g., zero knowledge 
proofs deciding whether they are interesting or not could be con- 
sidered |(4oin2p 

Second, if the number of compatible databases could be counted, 
then the pattern provider could evaluate how well the pattern user 
could detect the correct database from the patterns: without any 
background information, a randomly chosen database compatible 
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with the patterns would be the original one with probability 1/k 
where k is the number of databases compatible with the patterns. 
If there is more background information, however, then the proba- 
bility of finding the original database can sometimes made higher 
but still the number of compatible databases is likely to tell about 
the difficulty of finding the original database based on the patterns. 
Furthermore, the number of compatible databases can be used as 
a measure of how well the pattern collection characterizes the da- 
tabase. 

Third, the pattern collection can be considered as a collection 
of queries that should be answered correctly. Thus, the patterns 
can be used to optimize the database with respect to, e.g., query 
efficiency or space consumption. The optimization task could be 
expressed as follows: given the pattern collection, find the smallest 
database that gives the correct quality values for the patterns. 

In this chapter, the computational complexity of inverse pat- 
tern discovery is studied in the special case of frequent itemsets. 
Deciding whether there is a database compatible with the frequent 
itemsets and their frequencies is shown to be NP-complete although 
some special cases of the problem can be solved in polynomial time. 
Furthermore, finding the smallest compatible transaction database 
for an itemset collection consisting only of two disjoint maximal 
itemsets and all their subitemsets is shown to be NP-hard. 

Obviously inverse frequent itemset mining is just one example of 
inverse pattern discovery. For example, let us assume that the data 
is a d-dimensional matrix (i.e., a d-dimensional data cube |(tBLP96] ) 
and the summary of the data consists of the sums over each coordi- 
nate (i.e., all d— 1-dimensional sub-cubes). The computational com- 
plexity of the inverse pattern discovery for such data and patterns, 
i.e., the computational complexity of the problem of reconstruct- 
ing a multidimensional table compatible with the sums, has been 
studied in the field of discrete tomography |IIK99j : the problem is 
solvable in polynomial time if the matrix is two-dimensional binary 
matrix |Kuh89j and NP-hard otherwise |(;i)ni[inTTVWnnirrT94l| . In 
this chapter, however, we shall focus on inverting frequent itemset 
mining. 

This chapter is based on the article "On Inverse Frequent Set 
Mining" |Mie03ej . Some similar results were independently shown 
by Toon Calders |( 'alfl4aj . Recently, a heuristic method for the in- 
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verse frequent itemset mining problem has been proposed [WWWLOSj . 



7.1 Inverting Frequent Itemset Mining 

The problem of deducing a transaction database compatible with a 
given collection frequent itemsets are their supports can be formu- 
lated as follows. 

Problem 7.1 (inverse frequent itemset mining). Given a down- 
ward closed collection of itemsets and the support supp{X) for 
each itemset X £ J^, find a transaction database V compatible with 
the supports of the collection, i.e., a transaction database P such 
that supp{X,'D) = supp{X,J^) for all X £ J^. 

Example 7.1 (inverse frequent itemset mining). The collec- 
tion T = {0, A, B, C, AB, BC} with supports 

supp{%,T) = 6, 

supp{A,!F) = 4, 

supp{B,J^) = 4, 

supp{C,T) = 4, 

supp{AB,T) = 3 and 

supp{BC,T) = 3 

restrict the collection of transaction databases compatible with these 
constraints. For example, the following constraints can be deduced 
from the support constraints: 

• The support of the empty itemset tells that the number of 
transactions in any compatible databases is six. 

• There are exactly one transaction with A and without B 
{supp{A,T) — supp{AB,T) = 1), one with B and without C 
{supp{B, J^) — supp{BC, J^) = 1), and vice versa {supp{C, !F) — 
supp{BC,T) = 1). 

• The support of AC is at least two since there are at most two 
transactions that contain B but not A and C. 
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One transaction database compatible with these constraints is 

V = {(1, ABC) , (2, ABC) , (3, AB) , (4, BC) , (5, A) , (6, C)} 
since the supports 

supp{$,V) = 6, 

supp{A,V) = 4, 

supp{B,V) = 4, 

supp{C,V) = 4, 
supp{AB,V) = 3, 
supp{AC,V) = 2, 
supp{BC,T>) = 3 and 
supp{BC,V) = 3 

determined by 2? agree with the given supports. 

There are also other databases compatible with the constraints. 
For example, 

V = {{1, ABC) , (2, ABC) , (3, ABC) , (4, A) , (5, B) , (6, C)} 

is one such database. □ 

The reason why we use supports instead of frequencies in the 
inverse frequent itemset mining problem (Problem 17. 1|) is that sup- 
ports are slightly more informative. On one hand, the frequencies of 
the frequent itemsets X € J^{cr, D) can be computed from their sup- 
ports since fr[X,V) = supp{X,V)/ supp{fJ},'D). On the other hand, 
the supports cannot be computed from the frequencies: the number 
of transactions in the database, i.e., supp{%,T>) is not revealed by 
the frequencies of the itemsets. 



7.2 Frequent Itemsets and Projections 

To determine the complexity of the inverse frequent itemset mining 
problem (Problem 17. Ijl . let us consider an intermediate representa- 
tion between the frequent itemsets and the transaction database. 
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Definition 7.1 (projections of transaction databases). The 

projection of the transaction database D onto itemset X is a re- 
striction 

pr{X, V) = {{i,X nY) : {i, Y) £ V} 

of the database P. The collection of projections pr{X,'D) onto 
itemsets X € is denoted by 

pr{T, V) = {pr{X, V) : X e T] . 

Two projections pr{X,T>) and pr{X,T>') are considered to be 
equivalent if and only if \D\ = \'D'\ and there is a bijective mapping 
TT from tid{T>) to tid{T)') such that for each {i^Y) € T> there is 
(7r(z),y') e V with X r\Y = X r\Y' . (That is, the mapping tt 
is a permutation since we can assume that tid{T>) = tid{T>') = 
{1, . . . , l^l}; see Definition 

The projections of transaction databases have many desirable 
similarities to itemsets. For example, neglecting the transaction 
identifiers, the projections of the database T> onto maximal a- 
frequent itemsets contain the same information than the a-frequent 
itemsets and their supports. 

Theorem 7.1. The frequent itemsets in T{a,'D) and their sup- 
ports in V can be computed from the projections pr{TA4{a,V),V) 
and the projections equivalent to pr{J^M[a, V),V) can he computed 
from the frequent itemsets in T{a,'D) and their supports in T>. 

Proof. For each X € J-{cf, T>) and each Y ^ X we have 
supp{X,V) = \{{i,Z) £V : X C Z}\ 

= \{i,Yr\Z) eV: X c(Yr\Z)\ 

= supp{X, pr{Y,'D)). 

By definition, each u-frequent itemset X G J'{cr, T>) is contained 
in some maximal cr- frequent itemset Y E J-M.{(t,T)). Further- 
more, no cr-infrequent itemset is contained in any of the maximal 
fj- frequent itemsets in T). Thus, the collection J-{cr, P) of the a- 
frequent itemsets and their supports in T> can be computed from 
the collection pr{J^M{a,'D),'D) of projections of the transaction 
database P onto the maximal fi-frequent itemsets in D. 
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The projections equivalent to pr{J^A4{a,T>),T>) can be com- 
puted from the collection ^(o", V) of the cr-frequent itemsets and 
their supports in V by Algorithm 17.11 The running time of the 
algorithm is polynomial in \ J^{a,'D)\, \Z\ and \T>\ = supp{%,'D). 

The running time can be further improved if the transaction 
database V has a primitive for inserting k transactions consisting 
of an itemset X into V in time polynomial in and in |X| but not 
in k at all. Namely, then the running time of Algorithm 17.11 can be 
expressed as a polynomial of \T{(t,V)\ and i.e., not depending 
on the actual number of transactions in the transaction database 
V. 

The efficient insertion of k transactions with the itemset X into 
V can be implemented, e.g., by "run- length encoding" the database, 
i.e., by describing the transactions {i, X) ^ . . . , {i + k — 1, X) by the 
triple (z, k, X) . Then the insertion of k transactions with the itemset 
X to T>' can be implemented by inserting the tuple + 1, k^X) 
to v. □ 

Theorem 17.11 also implies at if J-{(t, V) = 2^ , then the whole 
transaction database T) (although without the correct transaction 
identifiers) can be reconstructed from the collection J-{(t, V) and 
their supports in T) in time polynomial in \ J-{(J, T>) \ and \D\ since the 
collection J^A4{a,T)) consists only of the itemset I. Furthermore, 
the supports of frequent itemsets can determine (implicitly) also 
supports of some infrequent itemsets |(:ain4bl irnn2| . 

Let us denote the projections determined by the downward closed 
itemset collection T by pr{A4,J^). The number of different itemsets 
in the transactions of pr(A4,J^) can be considerably smaller than 
the number itemsets in J^. Thus, each projection pr{X, T>) of T> onto 
X G TM.{(T, V) represented as a hst of tuples {count {Y, 'D),Y) ,Y C 
X, can be used as a condensed representation of the collection 
jr((7, V) and their supports. Such projections provide sometimes 
very small representations compared to J^{a, V) , Mie03c . 

As projections constructed from the collection J- [a, T)) of the 
(T-frequent itemsets and their supports in T) are (at least seem- 
ingly) closer to the original transaction database than the collection 
^((T, V) and their supports, the projections could be useful to make 
the inverse frequent itemset mining problem more comprehensible 
by an equivalent formulation of the problem. 
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Algorithm 7.1 An algorithm to compute projections equivalent to 

pr{J^Ai{a,V),V) from J^{a,T)) and their supports. 

Input: The collection J- {a, T>) a- frequent itemsets in a transaction 

database V and their supports. 
Output: The projections pr{J^M.{a,T>),T{a,V)) equivalent to 
projections pr{!FM{a,'D),T>). 
function To-PROJECTIONS(:F(cr, D), supp) 

J^Mia, V) = {X e T{a, V):Y ^X^Y ^ J'{a, V)} 
for slWX eTM{(j,V) do 

M ^ {X} 

for all y G J" do 

supp{Y,T) <— supp{Y,T>) 
end for 

while 7W / do 

for all y G M do 

V ^ V'U{{\V'\ + l,y) , . . . , {\V'\ + supp{Y,J^),Y)} 
for all Z G J^, Z C y do 

supp{Z,T) ^ supp{Z,T) — supp{Y,J^) 
if supp(Z,J^) = then 

end if 
end for 
end for 

M^{YeT:YcZ^Z^J^} 
end while 

priX,J^{a,V))^priX,V') 
end for 

return pr{J^M{a, V),J'{a, V)) 
end function 
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Problem 7.2 (database reconstruction from projections). 

Given a collection pr{M.,J^) of projections onto maximal itemsets, 
find a transaction database V such that pr{M.,J-') = pr{M,'D). 

There are, however, collections of projections that cannot be re- 
alized as downward closed itemset collections. We should be able to 
ensure in time polynomial in the sum of the cardinalities of trans- 
actions in the projections that the collection of projections can be 
realized as a downward closed itemset collection with some sup- 
ports. Fortunately, there are simple conditions that are necessary 
and sufficient to ensure that there is a downward closed itemset 
collection compatible with a given collection of projections. 

Theorem 7.2. The projections pr{Xi, J^i), pr{Xm, ^m) have 
the compatible collection of itemsets, i.e., a collection T such 
that 

supp{Y,pr{Xi,J^i)) = supp{Y,F) 
for all Y C Xj, \ < i < m, if and only if 

pr{Xi n Xj,J^i) = pr{Xi n Xj,J^j) 

for all 1 < i,j < m. 

Proof. If there is a downward closed itemset collection such that 

supp{Y,pr{Xi,J^i)) = supp{Y,J^) 

for all Y C Xj, 1 < i < m, then 

pr{Xi n Xj,J^i) = pr{Xi n Xj,J^j) 

for all 1 < j < m. Otherwise pr{Xi n Xj,Ti) and pr{Xi fi Xj,Tj) 
would determine different supports for some itemset Y C. XiCi Xj 
where 1 < i,j < m. 



If 

pr{XinXj,J^i) 
for all I < i,j < m, then 



pr{Xir\Xj,J^j) 



supp{Y,pr{XinXj,Ti)) 



suppiY, pr{Xi n Xj,J^j)) 



for all itemsets Y C XiCi Xj where 1 < i,j < m. 



□ 
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The number of transactions in the transaction database V can 
be exponential in the number of frequent itemsets (and thus also in 
the sum of the cardinalities of the frequent itemsets). 

Example 7.2 (a transaction database being exponentially 
larger than the frequent itemset collection). Let the itemset 
collection consist of just one itemset with support exponential in 
\2\. Then the number of transactions in V is exponential in \I\. □ 

This fact does not have to be considered as a drawback since 
most of the results shown in this chapter are hardness results. Fur- 
thermore, it is reasonable to assume that if one is trying to recon- 
struct a transaction database then the number of transaction in the 
database is not considered to be unfeasibly large. 

7.3 The Computational Complexity of the 
Problem 

In this section we show that Problem 17.21 is difficult in general but 
some of its special cases can be solved in polynomial time and even 
in logarithmic space. Our first hardness result shows that Prob- 
lem |7]21 is NP-hard in general. The hardness is shown by a reduc- 
tion from the graph 3-colorability problem: 

Problem 7.3 (graph 3-coIorability |G.T79p . Given a graph 
G = {V,E), decide whether there is a good 3-coloring, i.e., a la- 
beling label : V {i^,g,b} such that label{u) ^ label{v) for all 
{u,v} G E. 

Theorem 7.3. The problem of deciding whether there is a trans- 
action database T> compatible with the projections pr{M,J-) (i.e., 
the decision version of Problem \7.S^ is NP-complete even when the 
compatible transaction databases consist of only six transactions. 

Proof. The problem is clearly in NP since it can be verified in 
time polynomial in the sizes of pr{A4^J-) and T> whether a certain 
transaction database T) is compatible with projections pr{M.,T) 
simply by computing the projections pr{A4,T>). 

We show the NP-hardness of Problem 17.21 by a reduction from 
an instance G = {V, E) of the graph 3-colorability problem (Prob- 
lem l7.3j) to projections pr{M.,T) in pr{Ai,J^) are compatible with 
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the projections pr{TM.{a,'D),'D) of some transaction database V 
if and only if G is 3-colorable. 

Let the set X of items be {ry,gy,bv : v G V}. The projections 
are constructed as follows. For each edge {u, v} G E we define a 
projection 

{9u,rv}) , C^Agu^K}) , 
(5, {bu, r^ 

If the graph G = {V,E) is not 3-colorable then there is no trans- 
action database P compatible with the projections: for every 3- 
coloring of G, there is an edge {u,v} G E with label{u) = label{v) 
but none of the pairs {r„,r„}, {gu-iOv}-, and {bu-,bv} appear in the 
projection pr({r„, r^, 6„} , JF). Thus there is not even a 

partial solution of one transaction compatible the projections. 

If the graph G is 3-colorable then there is a transaction database 
T) that is compatible with the projections: the six transactions in 
the database V are the six permutations of a 3-coloring label such 
that label{u) ^ label{v) for all {u,v} E -E. □ 

As mentioned in the beginning of the chapter, it would be de- 
sirable to be able to estimate how many compatible databases 
there exist. The proof of Theorem 17.31 can also be adapted to 
give the hardness result for the counting version of Problem 17.21 



(See Pap95 for more details on counting complexity.) 



Theorem 7.4. The problem of counting the number of transac- 
tion databases V compatible with the projections pr{M.,T) is #-P- 
complete. 

Proof. The problem is in #P since its decision version is in NP. 
Using the reduction described in the proof of Theorem 17.31 the 
number of good 3-colorings could be counted: the number of good 3- 
colorings is 1/6! = 1/720 times the number of transaction databases 
compatible with the projections corresponding to the given graph 
G. As counting the number of good 3-colorings is ^^P-hard |(TJ79j . 
so is counting the number of compatible databases. □ 



Although the database reconstruction problem is NP-complete 
in general, there are some special cases that can be solved in poly- 
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nomial time. In one of the most simplest such cases the instance 
consists of only two projections (with arbitrary number of items). 

Theorem 7.5. It can be decided in polynomial time whether there 
is a transaction database T) that is compatible with given projections 
pr {Xi,Ti) and pr{X2,J^2)- Furthermore, the number of compatible 
transaction databases T) can be computed in polynomial time. 

Proof. By definition, the projection pr{Xi,Ti) is compatible with 
a transaction database V if and only if pr [Xi^Ti) = pr{Xi^T>) 
and the projection pr{X2,J-2) is compatible with T) if and only if 
pr {X2,T2) = V^i^iiT^)- The database V compatible with both 
projections if and only if 

pr{XinX2,Ti) = pr{XinX2,V) = pr{Xir\X2,T2). 
pr{Xi\X2,Ti) = pr{Xi\X2,V) 

and 

pr{X2\Xi,J^2)=pr{X2\Xi,V). 

A transaction database T> compatible with the two projections 
pr(Xi,.Fi) and pr{X2,J-2) can be found by sorting the transac- 
tions in the projections pr{Xi,!Fi) and pr(X2,J-'2) with respect to 
the itemsets in pr{Xi n X2,J^i) and pr{Xi n X2,T2), respectively. 
This can be implemented to run in time H X2 \ {"Dl) |Knu98j . 

This method for constructing the compatible database is shown as 
Algorithm 17.21 The running time of the algorithm is linear in the 
size of the input, i.e., in the sum of the cardinalities of the transac- 
tions in the projections. 

The number of transaction databases compatible with the pro- 
jections pr{Xi,'D) and pr{X2,'D) of a given transaction database 
V can be computed from the counts count{X, pr{Xi fi X2,'D)), 
count (Yi, pr{X 1,1))) and count(Y2, pr{X2,T>)) for all X, Yi and 
Y2 such that X = Yin X2 = Y2n Xi, Yi C Xi, Y2 C X2, 
count{Yi,pr{Xi,V)) > and count {Y2, pr{X2,V)) > 0. 

The collection 

S = {X Q XinX2 : count{X,pr{Xi n X2,V)) > 0} 



partitions the transactions in pr{Xi,'D) and pr{X2,T>) into equiva- 
lence classes of transactions with the same projections to Xi fi X2. 
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Algorithm 7.2 An algorithm for constructing a transaction data- 
base T> compatible with projections pr{Xi,J-'i) and pr{X2,J-2)- 
Input: Projections pr(Xi, ^i) and pr(X2, ^2)- 
Output: A transaction database V compatible with pr{Xi,J^i) 
and pr{X2,J-'2), or if such a database does not exist, 
function FROM-Two-To-ONE(pr(Xi, JF^), pr(X2, J^2)) 
Pi ^0 

for all {i,Y) G pr{Xi,Ti) do 
X ^YnX2 
Vi^ViU {X} 
Sj,^Sku{{i,Y)} 
end for 
P2 ^0 

for all {j,Z) G priX2,J^2) do 

X ^ znXi 
V2^r2u {X} 
5|.-5iu{0-,z)} 

end for 

if ^ V2 then 

return 
end if 

P ^ 

for all X G T'l do 

if / then 

return 
end if 

while 5^ 7^ do 

Choose (i, Y) G 5^ and (j, Z) G S\ arbitrarily. 
V *-V\J {\V\ + l,Y Z) 
S\^S\\{{i,Y)} 
Sl^Sl\{{3,Z)} 
end while 
end for 
return T> 
end function 
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The partition can be further refined by the cohections 

Sx = {Yi C Xi : n X2 = X, count {Yi,priXi,V)) > 0} 

and 

Sjc = {Y2 C X2 : ^2 n X2 = X, count{Y2,pr{X2,V)) > 0} . 

Using these collections, the number of compatible databases can 
be computed as follows. The transaction identifiers can be parti- 
tioned to classes X £ S in 

\V\l 

Uxes count{X, pr{Xi n X2, P))! 

ways. In each class X E 5, the transaction identifiers can be further 
partitioned into classes Y G in 

_ count{X,pr{Xir\X2,V))\ 
riyiGSj, count{Yi,pr{Xi,V))\ 

ways. Now we have counted the number of different projections 
pr{Xi,'D). The number of different databases that can be obtained 
by merging the transactions in pr{X2,'D) to the transactions of 
pr{Xi,T>) using the transaction identifiers of pr(Xi,2?) is 

, count (X, (Xi n X2 , P) ) ! 

bx ~ • 

11^2 652, count{Y2,pr{X2,V)y. 

Thus, the total number of transaction databases compatible with 
pr{Xi,T>) and pr{X2,V) is cHxgsO^^^- 1^ 

The practical relevancy of this positive result f Theorem 17. 5|) de- 
pends on how much the domains Xi and X2 overlap. If |Xi n X2I 
is very small but |Xi U X2I is large then there is a great danger that 
there are several compatible transaction databases. Fortunately, in 
the case of two projections we are able to efficiently count the num- 
ber of compatible databases and thus to evaluate the usefulness of 
the found database. 

In the simplest case of the database reconstruction problem all 
projections pr(Xi, .Fi ),..., pr(Xm,, ^m) are disjoint since in that 
case any database with projections pr{Xi,Ti), . . . , pr{Xm, Tm) is 
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compatible one. Unfortunately this also means that the number 
compatible databases is very large. Thus, one should probably 
require something more than mere compatibility. 

One natural restriction, applying the Occam's razor, is to search 
for the compatible database with the smallest number of transac- 
tions with different itemsets. This kind of database is (in some 
sense) the simplest hypothesis based on the downward closed item- 
set collection. This can be beneficial for both analyzing the data 
and actioning using the database. 

Unfortunately, it can be shown that finding the transaction data- 
base with the smallest number of different transactions is NP-hard 
for already two disjoint projections. We show the NP-hardness by 
a reduction from 3-partition problem: 

Problem 7.4 (3-partition [GJ79J). Given a set A of 31 elements, 
a bound i? € N, and a size s(a) € N for each a € A such that 
B/A < s{a) < B/2 and such that YlaeA ■^(^) ~ decide whether 
or not A can be partitioned into / disjoint sets Ai, . . . ,Ai such that 
for each Y2ai^Ai ~ ^ all 1 < i < /. 

Theorem 7.6. It is NP-hard to find a transaction database con- 
sisting of the smallest number of different transactions and being 
compatible with the projections pr{Xi,J^) and pr[X2,J-) such that 
Xi n X2 = 0. 

Proof. We show the NP-hardness of the problem by reduction from 
the 3-partition problem (Problem 17. 4|) . 

As Problem 17.41 is known to be strongly NP-complete, we can 
assume that the sizes s{a) of all elements a ^ A are bounded above 
by polynomial in /. 

The instance {A, B, s) of 3-partition can be encoded as two pro- 
jections as follows. Without loss of generality, let the elements of 
A he 1, ... ,31. Then 

Xi = {l,...,[31ogZl} 

and 

X2 = { [log 3/1 + 1, ... , [log 3/1 + [log I] } . 

Again, let us denote the binary coding of x € N as a set consisting 
the positions of ones in the binary code by bin{x). Then projection 
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pr{Xi,J^) consists of s{a) transactions consisting of the itemset 
hin{a) C Xi,a G A. Projection pr{X2,J-) consists of B transac- 
tions consisting of the itemset bin{b) + [log 3/] C X2, & G {1, . . . , /}. 

Clearly there is a 3-partition for {A,B, s) if and only if there is 
a database V with 31 different transactions that is compatible with 
projections |3r(Xi, ^) and pr{X2,T). □ 

Finally, let us note that if the number of items is fixed, then a 
compatible transaction database can be found in time polynomial 
in the number of transactions in the projections: Finding a trans- 
action database compatible with the projections can be formulated 
as a linear integer programming task where the variables are the 
possible different itemsets in the transactions. The number of possi- 
ble different itemsets is 2l-^L The linear integer programming tasks 
with a fixed number of variables can be solved in time polynomial 
in the size of the linear equations |L.T83j . 
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Conclusions 



Pattern discovery is an important subficld of data mining that at- 
tempts to discover interesting (or higli-quality) patterns from data. 
There are several efficient techniques to discover such patterns with 
respect to different interestingncss measures. Merely discovering 
the patterns efficiently is rarely the ultimate goal, but the patterns 
are discovered for some purpose. One important use of patterns is 
to summarize data, since the pattern collections together with the 
quality values of the patterns can be considered a summaries of the 
data. 

In this dissertation we have studied how the pattern collections 
could be summarized. Our approach has been five-fold. 

First, we studied how to cast views to pattern collections by 
simplifying the quality values of the patterns. In particular, we 
gave efficient algorithms for optimally discretizing the quality val- 
ues. Furthermore, we described how the discretizations can be used 
in conjunction with pruning of redundant patterns to simplify the 
pattern collections. 

Second, continuing with the theme of simplifying pattern collec- 
tions, we considered the trade-offs between the understandability 
and the accuracy of the pattern collections and their quality values. 
As a solution that supports exploratory data analysis, we proposed 
the pattern order ings. A pattern ordering of a pattern collection 
lists the patterns in such an order that each pattern improves our es- 
timate about the whole pattern collection as much as possible (with 
respect to given loss function and estimation method). Further- 
more, we showed that under certain reasonable assumptions each 
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length-A; prefix of the pattern ordering provides a fc-subcollection 
of patterns that is almost as good description of the whole pattern 
collection as the best fc-subcollection. We illustrated the applica- 
bility of pattern orderings in approximating pattern collections and 
data. 

Third, we examined how the structural properties (especially 
partial orders) of the pattern collections can be exploited to obtain 
clusterings of the patterns and more concise descriptions of the 
pattern collections. The same techniques can be used to simplify 
also transaction databases. 

Fourth, we proposed a generalization of association rules: change 
profiles. A change profile of a pattern describes how the quality 
value of the pattern has to be changed to obtain the quality values 
of neighboring patterns. The change profiles can be used to com- 
pare patterns with each other: patterns can be considered similar, 
if their change profiles are similar. We studied the computational 
complexity of clustering patterns based on their change profiles. 
The problem turned out to be quite difficult if some approxima- 
tion quality requirements are given. This does not rule out the use 
of heuristic clustering methods or hierarchical clustering. We illus- 
trated the hierarchical clusterings of change profiles using real data. 
In addition to clustering change profiles, we considered frequency 
estimation from approximate change profiles that could be used as 
building blocks of condensed representations of pattern collections. 
We provided efficient algorithms for the frequency estimation from 
the change profiles and evaluated empirically the noise tolerance of 
the methods. 

Fifth, we studied the problem of inverse pattern discovery, i.e., 
the problem of constructing data sets that could have induced the 
given patterns and their quality values. More specifically, we stud- 
ied the computational complexity of inverse frequent itcmsct min- 
ing. We showed that the problem of finding a transaction database 
compatible with a given collection of frequent itemsets and their 
supports is NP-hard in general, but some of its special cases are 
solvable in polynomial time. 

Although the problems studied in this dissertation are different, 
they have also many similarities. Frequency simplifications, pat- 
tern orderings, pattern chains and change profiles are all techniques 
for summarizing pattern collections. Frequency simplifications and 
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pattern orderings provide primarily approximations of the pattern 
collections, whereas pattern chains and change profiles describe the 
pattern collection by slightly more complex patterns obtained by 
combining the patterns of the underlying pattern collection. 

There are also many other ways to group the techniques. For ex- 
ample, the following similarities and dissimilarities can be observed: 

• Pattern orderings, pattern chains and change profiles make 
use of the relationships between the patterns directly, whereas 
frequency simplifications do not depend on the actual pat- 
terns. 

• Frequency simplifications, pattern orderings and change pro- 
files can be used to obtain an approximate description of the 
pattern collection, whereas pattern chains provide an exact 
description. 

• Frequency simplifications, pattern orderings and pattern chains 
describe the quality values of the patterns, whereas change 
profiles describe the changes in the quality values. 

• Frequency simplifications, pattern chains and change profiles 
can be used to cluster the patterns, whereas the interpretation 
of pattern orderings as clusterings is not so straightforward. 

Also inverse pattern discovery has similarities with the other prob- 
lems, as all the problems are related to the problem of evaluating 
the quality of the pattern collection. Furthermore, all problems 
are closely related to the two high-level themes of the dissertation, 
namely post-processing and condensed representations of pattern 
collections. 

As future work, exploring the possibilities and limitations of con- 
densed representations of pattern collections is likely to be contin- 
ued. One especially interesting question is how the pattern collec- 
tions should actually be represented. Some suggestions are provided 
in |Mie04c[ IMieOSal IMieOSb] . Also, measuring the complexity of the 
data and its relationships to condensed representations seems to be 
an important and promising research topic. 

As data mining is inherently exploratory process involving often 
huge data sets, a proper data management infrastructure seems to 
be necessary. A promising model for that, and for data mining as 
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whole, is offered by inductive databases |MLKn4j . There are many 
interesting questions related to inductive databases. For example, 
it is not completely clear what inductive databases are or what they 
should be |Mie04a| . 

Recently also the privacy issues of data mining have been rec- 
ognized to be of high importance |Pin02l IVBF"'"04] . There are two 
very important topics in privacy preserving data mining. First, 
sometimes no one has access to the whole data but still the data 
owners are interested in mining the data. There has been already 
many proposals for secure computation of many data mining re- 
sults, for example frequent itemsets |KSA(;n2[ lh'NPn4[ K^LLMOBL 
IVC02j . Second, in addition to computing the data mining results 
securely, it is often very important that the data mining results 
themselves are secure, i.e., that they do not leak any sensitive infor- 
mation about the data !F302l IMie04H IUZS04I IS VCOIL IVEEB+oi . 

Another important problem related to inductive databases is 
finding the underlying general principles of pattern discovery |MT97j 
There are many pattern discovery algorithms, but it is still largely 
open what are the essential differences between the methods and 
how to choose the technique for some particular pattern discovery 
task. Some preliminary evaluation of the techniques in the case 
of frequent itemset mining has recently been done |GZ03j . but the 
issues of a general theory of pattern discovery are still largely open. 
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